A NOVEL OPTIMIZATION ALGORITHM BASED ON REINFORCEMENT LEARNING Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
Abstract In this chapter, an efficient optimization algorithm is presented for the problems with hard to evaluate objective functions. It uses the reinforcement learning principle to determine the particle move in search for the optimum process. A model of successful actions is build and future actions are based on past experience. The step increment combines exploitation of the known search path and exploration for the improved search direction. The algorithm does not require any prior knowledge of the objective function, nor does it require any characteristics of such function. It is simple, intuitive and easy to implement and tune. The optimization algorithm was tested using several multi-variable functions and compared with other widely used random search optimization algorithms. Furthermore, the training of a multi-layer perceptron, to find a set of optimized weights, is treated as an optimization problem. The optimized multi-layer perceptron was applied to Iris database classification. Finally, the algorithm is used in image recognition to find a familiar object with retina sampling and micro-saccades.
1 Introduction Optimization is a process to find the maximum or the minimum function value within given constraints by changing values of its multiple variables. It can be the Janusz A. Starzyk Ohio University, School of Electrical Engineering and Computer Science, U.S.A. e-mail:
[email protected] Yinyin Liu Ohio University, School of Electrical Engineering and Computer Science, U.S.A. e-mail:
[email protected] Sebastian Batog Silesian University of Technology, Institute Of Computer Science, Poland e-mail:
[email protected]
1
2
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
essential for solving complex engineering problems in such areas as computer science, aerospace, machine intelligence applications, etc. When the analytical relation between the variables and the objective function value is explicitly known, analytical methods, such as Lagrange multiplier methods [1], interior point methods [18], Newton methods [30], gradient descent methods [25], etc., can be applied. However, in many practical applications, analytical methods do not apply. This happens when the objective functions are unknown, when relations between variables and function value are not given or difficult to find, when the functions are known while their derivatives are not applicable, or when the optimum value of function cannot be verified. In these cases, iterative search processes are required to find the function optimum. Direct search algorithms [10] contain a set of optimization methods that do not require derivatives and do not approximate either the objective functions or their derivatives. These algorithms find locations with better function values following a search strategy. They only need to compare the objective function values in successive iterative steps to make the move decision. Within the category of direct search, distinctions can be made among three classes including pattern search methods [28], simplex methods [6], and adaptive sets of search directions [23]. In pattern search methods, the variables of the function are varied by either steps of predetermined magnitude or the steps sizes are reduced at the same degree [15]. Simplex methods construct a simplex in ℜN using N+1 points and use the simplex to drive the search for optimum. The methods with adaptive sets of search directions, proposed by Rosenbrock [23] and Powell [21], construct conjugate directions using the information about the curvature of the objective function during the search. In order to avoid local minima, random search methods are developed utilizing randomness in setting the initial search points and other search parameters like the search direction or the step size. In Optimized Step-Size Random Search (OSSRS) [24], the step size is determined by fitting a quadratic function for the optimized function values in each of the random directions. The random direction is generated with a normal distribution of a given mean and standard deviation. Monte-Carlo optimizations adopted randomness in the search process to generate the possibilities to escape from the local minima. Simulate Annealing (SA) [13] is one typical kind of Monte-Carlo algorithm. It exploits the analogy between the search for a minimum in the optimization problem and the annealing process in which a metal cools and stabilizes into a minimum energy crystalline structure. It accepts the move to a new position with worse function value with a probability, which is controlled by the ”temperature” parameter, and the probability decreases along the ”cooling process”. SA can deal with highly nonlinear, chaotic problems provided that the cooling schedule and other parameters are carefully tuned. Particle Swarm Optimization (PSO) [11] is a population-based evolutionary computational algorithm. It exploits the cooperation within the solution population instead of the competition among them. At each iteration in PSO, a group of search particles make moves in a mutually coordinated fashion. The step size of a particle is a function of both the best solution found by that particle and the best solution found so far by all the particles in the group. The use of a population of search particles
Reinforcement learning
3
and the cooperation among them enable the algorithm to evaluate function values in a wide range of variables in the input space and to find the optimum position. Each particle only remembers its best solution and the global best solution of the group to determine its step sizes. Generally, during the course of search, a sequence of decisions on the step sizes is made and a number of function values are obtained in these optimization methods. In order to implement an efficient search for the optimum point, it is desired that such historical information can be utilized in the optimization process. Reinforcement Learning (RL) [27] is a type of learning process to maximize certain numerical values by combining exploration and exploitation and using rewards as learning stimuli. In the reinforcement learning problem, the learning agent performs the experiments to interact with the unknown environment and accumulate the knowledge during this process. It is a trial-and-error exploratory process with the objective to find the optimum action. During this process, an agent can learn to build the model of the environment to instruct its search, so that the agent can predict the environment’s response to its actions and choose the most useful actions for its objectives based on its past exploring experience. Surrogate based optimization refers to an idea of speeding optimization process by using surrogates for the objectives and constraints functions. The surrogates also allow for the optimization of problems with non-smooth or noisy responses, and can provide insight into the nature of the design space. The max-min SAGA approach [20] is to search for designs that have the best worst case performance in the presence of parameter uncertainty. By leveraging a trust-region approach which uses computationally cheap surrogate models, the present approach allows for the possibility of achieving robust design solutions on a limited computational budged. Another example of a surrogate based optimization is the surrogate assisted Hooke-Jeeves (SAHJA) algorithm [8] which can be used as a local component of a global optimization algorithm. This local searcher uses the Hooke-Jeeves method, which performs its exploration of the input space intelligently employing both the real fitness and an approximated function. The idea of building knowledge about an unknown problem through exploration can be applied in the optimization problems. To find the optimum of an unknown multivariable function, an efficient search procedure can be performed using only historical information from conducted experiments to expedite the search. In this chapter, a novel and efficient optimization algorithm based on reinforcement learning is presented. This algorithm uses simple search operators and will be called reinforcement learning optimization (RLO) in the later sections. It does not require any prior knowledge of the objective function or function’s gradient information, nor does it require any characteristics of the objective function. In addition, it is conceptually very simple and easy to implement. This approach to optimization is compatible with the neural networks and learning through interaction, thus it is useful for systems of embodied intelligence and motivated learning as presented in [26]. The following section presents the RLO method and illustrates it within several machine learning applications.
4
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
2 Optimization Algorithm 2.1 Basic search procedure A N-variable optimization objective function V = f (p1 , p2 , ..., pN ) (p1 , p2 , ..., pN ,V ∈ ℜ1 ) could have several local minima and several global minima Vopt1 , ...,VoptN . It is desired that the search process, initiated from a random point, finds a path to the global optimum point. Unlike particle swarm optimization [11], this process can be performed with a single search particle that learns how to find its way to the optimum point. It does not require the cooperation among a group of particles, although implementing the cooperation among several search particles may further enhance the search process in this method. At each point of the search, the search particle intends to find a new location with a better value within a searching range around it and then determines the direction and the step size for the next move. It tries to reach the optimum by exploring weighted random search of each variable (coordinate). The step size of search in each variable is randomly generated with its own probability density function. These functions are gradually learned during the search process. It is expected that at the later stage of search, the probability density functions are approximated for each variable. Then the stochastically randomized path to the minimum point of the function from the start point is learned. The step sizes of all the coordinates determine the center of the new searching area and the standard deviations of the probability functions determine the size of the new searching area around the center. In the new searching area, several locations PS are randomly generated. If there is a location p’ with better value than the current one, the search operator moves to it. From this new location, new step sizes and new searching range are determined, so that the search for optimum continues. If in the current searching area, there is no point with better value that the search particle can move to, another set of random points are generated until no improvement is obtained after several, say M, trials. Then the searching area size and step sizes are modified in order to find a better function value. If no better value is found after K trials of generating different searching areas or the proposed stopping criterion is met, we can claim that the optimum point has been found. The algorithm of searching for the minimum point is schematically shown in the Figure 1.
Reinforcement learning
5
Fig. 1 The algorithm of RLO searching for the minimum point
2.2 Extracting historical information by weighted optimized approximation After the search particle makes a sequence of n moves, the step sizes of these moves d pti (t = 1, 2, ..., n; i = 1, 2, ..., N) are available for learning. These historical steps have made the search particle move towards better values of the objective function and hopefully get closer to the optimum location. In this sense, these steps are the successful actions during the trial. It is proposed that the successful actions which result in a positive reinforcement (as the step sizes of each coordinate) follow a function of the iterative steps t, as in (1), where dpi represents the step sizes on ith coordinate and f i (t) is the function for coordinate i. d pi = fi (t) (i = 1, 2, ..., N),
(1)
These unknown functions f i (t) can be approximated, for example, using polynomials through the least-squared fit (LSF) process. a0 i a1 dp 1 t1 t12 ... t1B 1 ... ... ... ... ... a2 ... = (2) i ... d p 1 tn tn2 ... tnB n aB In (2), the step sizes from d pi1 to d pin are the step sizes on a certain coordinate during n steps and are fitted as unknown function values using polynomials of order B. The polynomial coefficients a0 to aB can be obtained and will represent the function f i (t) to estimate dpi ,
6
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
d pi =
B
∑ a jt j .
(3)
j=0
Using polynomials for function approximation could be easy and efficient. However, considering the characteristic of optimization problems, we have two concerns. First, in order to generate a good approximation while avoiding overfitting, a proper order of polynomials must be selected. In the optimized approximation algorithm (OAA) presented in [17], the goodness of fit is determined by the so-called signalto-noise ratio figure (SNRF). Based on SNRF, an approximation stopping criterion was developed. Using a certain set of basis functions for approximation, the error signal, computed as the difference between the approximated function and the sampled data, can be examined by SNRF to determine how much useful information it contains. The SNRF for the error signal, denoted as SNRF e , is compared to the precalculated SNRF for white Gaussian noise (WGN), denoted as SNRFW GN . If SNRF e is higher thanSNRFW GN , more basis functions should be used to improve the learning. Otherwise, the error signal shows the characteristic of WGN and should not be reduced any more to avoid fitting into the noise, and the obtained approximated function is the optimum function. Such process can be applied to determine the proper order of the polynomial. The second concern is that in the case of reinforcement learning, the knowledge about originally unknown environment is gradually accumulated throughout the learning process. The information that the learning system obtains at the beginning of the process is mostly based on initially random exploration. During the process of interaction, the learning system collects the historical information and builds the model of the environment. The model can be updated after each step of interaction. The decisions made at the later stages of the interaction are more based on the built model rather than a random exploration. This means that the recent results are more important and should be weighted more heavily than the old ones. For example, the weights applied can be exponentially increasing from the initial trials to the recent ones, as
αt (t = 1, 2, ..., n), (4) n where we can define α n = n. As a result, the weights are in the open interval (0:1], and weight is 1 for the most recent sample. Applying the weights in the LSF, we have the weighted least-squared fit (WLSF), expressed as follows. a0 a1 d p 1 w1 1 · w1 t1 w1 t12 w1 ... t1B w1 ... ... ... ... ... a2 ... = (5) ... d p w 1 · wn tn wn tn2 wn ... tnB wn n n aB wt =
Due to the weights applied to the given samples, the approximated function will fit to the recent data better than to the old ones.
Reinforcement learning
7
Utilizing the concept of OAA to obtain optimized WLSF, the SNRF for the error signal or WGN has to be estimated considering the sample weights. In the original OAA for one-dimensional problem [17], the SNRF for error signal was calculated as, SNRFe =
C(e j , e j−1 ) C(e j , e j ) −C(e j , e j−1 )
(6)
where C represents the correlation calculation, e j represents the error signal (j=1,2,...,n), e j−1 represents the (circular) shifted version of the e j . The characteristics of SNRF for WGN, expressed through the average value and the standard deviation, can be estimated from Monte-Carlo simulation, as (see derivation at [17])
µSNRF W GN (n) = 0
(7)
1 σSNRF W GN (n) = √ . n
(8)
Then the threshold, which determines whether SNRF e shows the characteristic of SNRFW GN and the fitting error should not be further reduced, is, thSNRF W GN (n) = µSNRF W GN (n) + 1.7σSNRF W GN (n).
(9)
For the weighted approximation, the SNRF for the error signal is calculated as, SNRFe =
C(e j · w j , e j−1 · w j−1 ) . C(e j · w j , e j · w j ) −C(e j · w j , e j−1 · w j−1 )
(10)
In Fig.2(a), σSNRF W GN (n) from a 200-run Monte-Carlo simulation is shown in the logarithmic scale. The σSNRF W GN (n) can be estimated as 2 σSNRF W GN (N) = √ . n
(11)
It is found that the 5% significance level can be approximated by the average value plus 1.5 times standard deviations for an arbitrary n. Fig.2(b) illustrates the histogram of SNRFW GN with 216 samples, as an example. The threshold in this case of a dataset with 216 samples can be calculated using µ + 1.5σ = 0 + 1.5 × 0.0078 = 0.0117. Therefore, to obtain an optimized weighted approximation in one-dimensional case, the following algorithm is performed. Optimized weighted approximation algorithm (OWAA): Step (1). Assume that an unknown function F, with input space t ⊂ ℜ1 is described by n training samples as d pt , (t = 1, 2, ..., n). Step (2). The signal detection threshold is pre-calculated for the given number of samples n based on SNRFW GN . For a one-dimensional problem,
8
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
Fig. 2 Characteristic of SNRF for WGN in weighted approximation
1.5 · 2 thSNRF W GN (n) = √ . n Step (3). Take a set of basis functions, for example, polynomials of order from 0 up to order B. Step (4). Use these B+1 basis functions to obtain the approximated function, dˆpt =
B+1
∑
fl (xt ) (t = 1, 2, ..., n).
(12)
l=1
Step (5). Calculate the approximation error signal, et = d pt − dˆpt
(t = 1, 2, ..., n).
(13)
Step (6). Determine SNRF of the error signal using (10). Step (7). Compare the SNRF e with thSNRF W GN . If the SNRF e is equal to or less than thSNRF W GN , or if B exceeds the number of samples, stop the procedure. In such case Fˆ is the optimized approximation. Otherwise, add one basis function, in this example increase the order of the approximating polynomial to B+1 and repeat Steps (4)-(7). Using the above algorithm, the proper order of polynomial is determined to extract the useful (but not the noise) information from the historical data. Also, the extracted information will fit into the recent results better than to the old ones. We illustrate this process of learning historical information by considering a 2variable function as an example.
Reinforcement learning
9
Example. The function V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ) has several local minima, but only one global minimum, as shown in Fig. 3. In the process of interaction, the historical information after each iteration is collected. The historical step sizes of 2 coordinates are separately approximated, as shown in the Fig. 4 (a) and (b). The step sizes of two coordinates are approximated by quadratic polynomials which are determined by OWAA and the coefficients of polynomials are obtained using WLSF. In Fig. 4, the approximated functions are compared with the quadratic polynomials whose coefficients are obtained from LSF. Again, it is observed that, the function obtained using WLSF is fitted closer to the data in later iterations than the function obtained using LSF.
Fig. 3 A 2-variable function V (p1 , p2 )
Fig. 4 Function approximation for historical step sizes
10
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
The level of the approximation error signal et for step sizes of a certain coordinate dpi , which is the difference between the observed sampled data and the approximated function, can be measured by its standard deviation, as shown in (14). v ( ) u u1 n 2 σ pi = t (14) ∑ (et − e)¯ n t=1 This standard deviation will be called the approximation deviation in the following discussion. It represents the maximum deviation of the location of the search particle from the prediction by the approximated function in the unknown function optimization problem.
2.3 Predicting new step sizes The approximated functions will be used to determine the step sizes for the next iteration, as shown in (15) and Fig. 5 along with the approximated functions. i d pt+1 = f i (t + 1)
(15)
Fig. 5 Prediction of the step sizes for the next iteration
The step size functions are the model of environment that the learning system builds during the process of interaction based on historical information. The future step size determined by such model can be employed as exploitation of the existing model. However, such model built during the learning process cannot be treated as exact. Besides exploitation which best utilizes the obtained model, exploration is desired to a certain degree in order to improve the model and discover better solutions. The exploration can be implemented using Gaussian random generator (GRG). As a good trade-off between exploitation and exploration is needed, we propose to use
Reinforcement learning
11
the step sizes for the next iteration determined by the step size functions as the mean value and the approximation deviation as the standard deviation of the random generator. Gaussian random generators give several random choices of the step sizes. Effectively, the determined step sizes of multiple coordinates generate the center of the searching area, and the size of the searching range is determined by the standard deviations of GRG for the coordinates. The multiple random values generated by GRG for each coordinate effectively create multiple locations within the searching area. The objective function values of these locations will be compared and the location with the best value, called current best location, will be chosen as the place from which the search particle will continue searching in the next iteration. Therefore, the actual step sizes are calculated using the distance from the “previous best location” to the “current best location”. The actual step sizes will be added in the historical step sizes and used to update the model of the unknown environment. Several locations of the search particle in this approach are illustrated in Fig. 6 using a 2-variable function as an example. The search particle was located at previous best location p prev (p1prev , p2prev ) and the previous step size was found as d p prev (d p1prev , d p2prev ) after current best location p(p1 , p2 )is found as the best location in previous searching area (an area with p(p1 , p2 ) in it, not shown in the figure). At current best position p(p1 , p2 ), using the environment model built with historical step sizes, the current step size is determined to be dp1 on coordinate 1 and dp2 on coordinate 2, so that the center of the searching area is determined. The approximation deviations of two coordinates σ p1 and σ p2 give the size of the searching range. Within the searching range, several random points are generated in order to find a better position to which the search operator will move.
Fig. 6 Step sizes and searching area
12
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
2.4 Stopping criterion Search particle moves from every “previous best location” to “current best location” and step sizes actually taken are used for model learning. As new step sizes are generated, the search particle is expected to move to locations with better objective function values. In the proposed algorithm, the search particle only makes the move when a location with a better function value is found. However, if all the points generated in the current searching range have no better function values than the current best value, the search particle does not move and the GRG will repeat generating groups of particle locations for several trials. If no better location is found after M trials, we suspect that the current searching range is too small or the current step size is too large, which makes us miss the locations with better function values. In such case, we should enlarge the size of the searching area, and reduce the step size, as in (16),
σ pi = α σ pi d pi = ε d p i
(i = 1, 2, ..., N),
(16)
where α > 1, and ε < 1. If this new search is still not successful, the searching range and the step size will continue changing until some points with better function values are found. If at certain step of the search process, in order to find the new location with better function values, the current step size is reduced to be too small to make the search particle move anywhere, it indicates that the optimum point has been reached. The stop criterion can be defined by the current step size being β times smaller than the previous step size, as, d p < β d p prev
(0 < β < 1, β is usually small).
(17)
2.5 Optimization algorithm Based on previous discussion, the proposed optimization algorithm (RLO) can be described as follows. (a). The procedure starts from a random point of the objective function with Nvariables V = f (p1 , p2 , ..., pN ) . It will try to make a series of moves to get closer to the global optimum point. (b). To change from the current location, the step size dpi (i=1, 2, . . . ,N) and the standard deviation σ pi (i = 1, 2, ..., N) for each coordinate are generated from the uniform probability distribution. (c). The step sizes dpi (i=1, 2, . . . N) determine the center of the searching area. The deviations of all the coordinates σ pi (i = 1, 2, ..., N) determine the size of the searching area. Several points Ps in this range are randomly chosen from Gaussian distribution using dpi as mean values and σ pi as standard deviations.
Reinforcement learning
13
(d). The objective function values are evaluated at these new points. Compare the objective function values on random points with that at the current location. (e). If the new points generated in Step (c) have no better values than the current position, Step (c ) is repeated for up to M trials until point with better function value is found. (f). If the search fails after M trials, enlarge the size of the searching area, and reduce the step size, as in (16). (g). If the search with the updated searching area size and the step sizes from Step (f) is not successful, the range and the step size will keep being adjusted until either some points with better values are found, the current step sizes are much smaller than previous step sizes as in (17), or function value changed by less than a prespecified threshold. If any of these conditions happens then the algorithm terminates. This also indicates that the optimum point has been reached. (h). Move the search particle to the point p(p1 , p2 ) with the best function value V b (a local minimum or maximum depending on the optimization objective). The distance between previous best point p prev (p1prev , p2prev ) and current best point p(p1 , p2 ) gives the actual step size dpi (i=1, 2, . . . , N). Collect the historical information of the step sizes taken during the search process. (i). Approximate the function of the step sizes as a function of iterative steps using weighted least-square fit as in (5). The proper maximum order of the basis functions is determined using SNRF described in section 2.2 to avoid overfitting. (j). Use the modeled function to determine the step sizes dpi (i=1, 2, . . . ,N) for the next iteration step. The approximation deviation difference between the approximated step sizes and the actual step sizes σ pi (i = 1, 2, ..., N) gives the approximation deviation. Repeat Step (c) to (j). In general, the optimization algorithm based on the reinforcement learning builds the model of successful moves for a given objective function. The model is built based on historical successful actions and it is used to determine new actions. The algorithm combines the exploitation and exploration of searching using random generators. The optimization algorithm does not require any prior knowledge of the objective function or its derivatives nor there are any special requirements put on the objective function. The use of search operator is conceptually very simple and intuitive. In the following section, the algorithm is verified using several experiments.
14
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
3 Simulation and discussion 3.1 Finding global minimum of a multi-variable function 3.1.1 A synthetic bivariate function A synthetic bivariate function V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ), used previously in the example in section 2.2, is used as the objective function. This function has several local minima and one global minimum equal to -112.2586. The optimization algorithm starts at a random point and performs the search process looking for the optimum point (minimum in this example). The number of random points Ps generated in the searching area in each step is 10. The scaling factors α and ε in (16) are 1.1 and 0.9. The β in (17) is 0.005. One possible search path is shown in Fig. 7 from the start location to the final optimum location as found by RLO algorithm. The global optimum is found in 13 iterative steps. The historical locations are shown in the figure as well. During the search process, the historical step sizes taken are shown in Fig. 8 with their approximation by WLSF.
Fig. 7 Search path from start point to optimum
Example of another search process starting from another random point is performed and is shown in Fig. 9. The global optimum is found in 10 iterative steps. Table 4.1 shows changes in the numerical function values and adjustment of the step sizes dp1 and dp2 for p1 and p2 in the successive search steps. Notice how the step size was initially reduced to be increased again once the algorithm started to follow a correct path towards the optimum.
Reinforcement learning
15
Fig. 8 Step sizes taken during the search process
Fig. 9 Search path from start point to optimum
Table 1. Function values and step sizes in a searching process Search steps Function value V (p1 , p2 ) Step size d p1 Step size d p2 1 2 3 4 5 6 7 8 9 10
1.4430 -34.8100 -61.4957 -69.8342 -70.5394 -71.5813 -109.0453 -110.8888 -112.0104 -112.1666
2.9455 0.3570 -0.0508 -0.0477 -0.1232 0.0000 -0.0281 0.0495 0.0438
0.8606 -1.7924 -0.7299 -0.3114 0.2015 4.4358 0.3408 -0.0531 -0.0772
16
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
Such search process was performed for 300 random trials. The success rate of finding the global optimum is 93.78%. On average, it takes 5.9 steps and 4299 function evaluations to find the optimum in this problem. The same problems are tested on several other direct search based optimization algorithms, including SA [29], PSO [14] and OSSRS [2]. The success rate of finding global optimum and the average number of function evaluations are compared in Tables 2, 3, 4. All the simulations were performed using an Intel Core Duo 2.2GHz based PC, with 2GB of RAM. Table 2. Comparison of optimization performances on synthetic function RLO
SA
PSO
OSSRS
Success rate of finding the global optimum 93.78% 29.08% 94.89% 52.21% Number of function evaluations 4299 13118 4087 313 CPU time consumption [s] 28.4 254.35 20.29 1.95
3.1.2 Six-hump camel back function The classic 2D six-hump camel back function [5] has 6 local minima and 2 global minima. The function is given as p4 V (p1 , p2 ) = (4 − 2.1p21 + 1 ) p21 + p1 p2 + (−4 + 4p22 ) p22 (p1 ∈ [−3, 3], p2 ∈ 3 [−2, 2]). Within the specified bounded region, the function has 2 global minima equal to 1.0316. The optimization performances of these algorithms from 300 random trials are compared in Table 3. Table 3. Comparison of optimization performances on six-hump camel back function RLO SA PSO OSSRS Success rate of finding the global optimum 80.33% 45.22% 86.44% 42.67% Number of function evaluations 5016 8045.5 3971 256 CPU time consumption [s] 33.60 151.86 20.35 1.63
3.1.3 Banana function The Rosenbrock’s famous “banana function” [23], as V (p1 , p2 ) = 100(p2 − p21 )2 + (1 − p1 )2 , has 1 global minimum equal to 0 lying inside a narrow, curved valley. The optimization performances of these algorithms from 300 random trials are compared in Table 4.
Reinforcement learning
17
Table 4. Comparison of optimization performances on banana function RLO
SA
PSO
Success rate of finding the global optimum 74.55% 3.33% 41% 48883.7 28412 4168 Number of function evaluations CPU time consumption [s] 320.74 539.38 20.27
OSSRS 88.89% 882.4 5.15
In these optimization problems, RLO demonstrates consistently satisfactory performance without particular tuning of the parameters. However, other methods show different level of efficiency and capabilities of handling various problems.
3.2 Optimization of weights in multi-layer perceptron training The output of a multi-layer perceptron (MLP) can be looked at as the value of a function with the weights as the approximation variables. Training the MLP, in the sense of finding optimal values of weights to accomplish the learning task, can be treated as an optimization problem. We can take the Iris plant database [22] as a testing case. The Iris database contains 3 classes, 5 numerical features and 150 samples. In order to accomplish the classification of the iris samples, a 3-layered MLP with an input layer, a hidden layer and an output layer can be used. The size of the input layer should be equal to the number of features. The size of the hidden layer is chosen to be 6, and since the class IDs are numerical values equal to 1, 2 and 3, the size of the output layer is 1. The weight matrix between the input layer and the hidden layer contains 30 elements, and the one between the hidden layer and the output layer contains 6 elements. Overall, there are 36 weight elements (parameters) to be optimized. In a typical trial, the optimization algorithm finds the optimal set of weights after only 3 iterations. In the testing stage, the outputs of the MLP are rounded to be the nearest integers to indicate predicted class IDs. Comparing the given class IDs and the predicted class IDs from the MLP in Fig. 10, it is obtained that 146 out of 150 iris samples can be correctly classified by such set of weights and the percentage of correct classification is 97.3%. A single support vector machine (SVM) achieved 96.73% classification rate [12]. In addition, a MLP with the same structure, training by back-propagation (BP) achieved 96% on Iris test case. The MLP and BP are implemented using MATLAB neural network toolbox.
3.3 Micro-saccade optimization in active vision for machine intelligence In the area of machine intelligence, active vision becomes an interesting topic. Instead of taking in the whole scene captured by the camera and making sense of all the information in the conventional computer vision approach, active vision agent
18
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
Fig. 10 RLO performance on neural network training on Iris problem
focuses on small parts of the scene and moves its fixation frequently. Human and other animals use such quick movement of both eyes, which is called saccade [3], to focus on the interesting part of the scene and efficiently use its own resources. The interesting parts are usually important features of the input, and with the important features being extracted, the high-resolution scene is analyzed and recognized with relatively small number of samples. In a saccade movement network (SMN) presented in [16], the original images are transformed into a set of low resolution images after saccade movements and retina sampling. The set of images, as the sampled features, are fed to the self-organizing winner-take-all classifier (SOWTAC) network for recognition. To find interesting features of the input image and to direct the movements of saccade, image segmentation, edge detection and basic morphology tools [4] are utilized. Fig. 11 (a) shows a face image from [7] with 320×240 pixels. The interesting features found are shown in Fig. 11 (b). The stars represent the center of the four interesting features found on a face image and the rectangles represent the feature boundaries. Then, the retina sampling model [16] places its fovea at the center of each interesting feature, so that these features will be extracted. Practically, the centers of the interesting features found by image processing tools [4] are not guaranteed to be the accurate centers, which will affect the accuracy of feature extraction and pattern recognition process. In order to help to find the optimum sampling position, RLO algorithm can be used to direct the move of the fovea of the retina and find the closest match between the obtained sample features and pre-stored reference sample features. These slight moves during fixation to find the optimum sampling positions can be called microsaccades in the active vision process, although the actual role of microsaccades has been a debate topic unsolved for several decades [19].
Reinforcement learning
19
Fig. 11 Face image and its interesting features in active vision [16]
Fig. 12 Image sampling by micro-saccade
Fig.12 (a) shows a group of ideal samples of important features in face recognition. Fig. 12 (b) shows the group of sampled features with initial sampling positions. In the optimization process, the x-y coordinates need to be optimized so that the sampled images have the optimum similarity to the ideal images. The level of similarity can be measured by the sum of squared intensity difference [9]. In this metric, increased similarity will have decreased intensity difference. Such problem can be also perceived as an image registration problem. The two-variables objective function V(x, y), the sum of squared intensity difference, needs to be minimized through RLO algorithm. It is noted that the only information available is that V can be the
20
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
function of x and y coordinates. How the function would be expressed and what are its characteristics are totally unknown. The minimum value of the objective function is not known either. RLO would be the suitable algorithm for such optimization problem. Fig.12 (c) shows the optimized sampled images using RLO-directed microsaccades. The optimized feature samples are closer to the ideal feature samples, which will help the processing of the face image. After the featured images are obtained through RLO-directed microsaccades, these low-resolution images, instead of the entire high-resolution face image, are sent to the SOWTAC network for further processing or recognition.
4 Conclusions In this chapter, a novel and efficient optimization algorithm is presented for the problems in which the objective functions are unknown. The search particle is able to build the model of successful actions and choose its future action based on the past exploring experience. The decisions on the step sizes (and directions) are made based on a trade-off between exploitation of the known search path and exploration for the improved search direction. In this sense, this algorithm falls into a category of reinforcement learning based optimization (RLO) methods. The algorithm does not require any prior knowledge of the objective function, nor does it require any characteristics of such function. It is conceptually very simple and intuitive as well as very easy to implement and tune. The optimization algorithm was tested and verified using several multi-variable functions and compared with several other widely used random search optimization algorithms. Furthermore, the training of a multi-layer perceptron (MLP), based on finding a set of optimized weights to accomplish the learning, is treated as an optimization problem. The proposed RLO was used to find the weights of MLP in the training problem on Iris database. Finally, the algorithm is used in the image recognition process to find a familiar object with retina sampling and micro-saccades. The performance of RLO, will depend to a certain degree on the values of several parameters that this algorithm uses. With certain preset parameters, the performance of RLO can meet our requirements in several machine learning problems involved in our current research. In the future research, a theoretical and systematic analysis of the effect of these parameters will be conducted. In addition, using a group of search particles and their cooperation and competition, a population based RLO can be developed. With the help of model approximation techniques and the trade-off between exploration and exploitation proposed in this work, the population based RLO is expected to have better performance.
Reinforcement learning
21
References 1. G. Arfken, ”Lagrange Multipliers.” §17.6 in Mathematical Methods for Physicists, 3rd ed. Orlando, FL: Academic Press, pp. 945-950, 1985. 2. S. Belur, A random search method for the optimization of a function of n variables, MATLAB central file exchange, [Online] Available: http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=100. 3. B. Cassin, S. Solomon, Dictionary of Eye Terminology.Gainsville, Florida: Triad Publishing Company, 1990. 4. Detecting a Cell Using Image Segmentation. Image Processing Toolbox, the Mathworks.[Online] Available: http://www.mathworks.com/products/image/demos.html. 5. L. C. W. Dixon, G. P. Szego, “The optimization problem: An introduction,” Towards Global Optimization II, New York: North Holland, 1978. 6. J. A. Eelder, R. Mead, “A simplex method for function minimization,” The Computer Journal, vol. 7, pp. 308-313, 1965. 7. Facegen Modeller. Singular Inversions. [Online] Available: http://www.facegen.com/products.htm. 8. X. del Toro Garcia, F. Neri, G. L. Cascella, N. Salvatore, “A surrogate associated Hooke-Jeeves algorithm to optimize the control system of a PMSM drive,” IEEE ISIE, July, 2006, pp. 347-352. 9. D. L. G. Hill and P. Batchelor, “Registration methodology: concepts and algorithms,” Medical Image Registration, J. V. Hajnal, D. L. G. Hill, and D. J. Hawkes, Eds. Boca Raton, FL: CRC, 2001. 10. R. Hooke, T. A. Jeeves, “Direct search solution of numerical and statistical problems,” Journal of the Association for Computing Machinery, vol. 8, pp. 212-229, 1961. 11. J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proc. IEEE Int. Conf. Neural Networks, vol. 4, pp. 1942–1948, Perth, Australia, Dec. 1995. 12. H. Kim, S. Pang, H. Je, “Support vector machine ensemble with bagging,” Proc. of 1st Int. Workshop on Pattern Recognition with Support Vector Machines, SVM’2002, Niagara Falls, Canada, August 2002. 13. S. Kirkpatrick, C. D. Gelatt, Jr., M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671-680, 1983. 14. A. Leontitsis, Hybrid Particle Swarm Optimization, MATLAB central file exchange, [Online] Available: http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do? objectId=6497. 15. R. M. Lewis, V. Torczon, and M. W. Trosset, “Direct search methods: Then and now,” Journal of Computational and Applied Mathematics, vol. 124, no. 1, pp. 191-207, 2000. 16. Y. Li, Active Vision through Invariant Representations and Saccade Movements, Master thesis, School of Electrical Engineering and Computer Science, Ohio University, 2006.
22
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
17. Y. Liu, J. A. Starzyk, Z. Zhu, “Optimized Approximation Algorithm in Neural Networks without overfitting”, IEEE Trans. on Neural Networks, vol. 19, no. 4, June, 2008, pp. 983-995. 18. I. J. Lustig, R. E. Marsten, D. F. Shanno, “Computational Experience with a Primal-Dual Interior Point Method for Linear Programming”, Linear Algebra and its Application. vol.152, pp.191-222, 1991. 19. S. Martinez-Conde, S. L. Macknik, D. H. Hubel. “The role of fixational eye movements in visual perception”. Nature Reviews Neuroscience, vol. 5, no. 3, pp.229-240, 2004. 20. Yew-Soon Ong, “Max-min surrogate-assisted evolutionary algorithm for robust design,” IEEE Trans. on Evolutionary Computation, vol.10, no. 4, August 2006, pp. 392-404. 21. M. J. D. Powell, “An efficient method for finding the minimum of a function of several variables without calculating derivatives,” The Computer Journal, vol. 7, pp. 155-162, 1964. 22. R. A. Fisher, Iris Plants Database, [Online] Available: http://faculty.cs.byu.edu/˜cgc/Teaching/CS 478/iris.arff, July, 1988. 23. H. H. Rosenbrock, “An automatic method for finding the greatest or least value of a function,” The Computer Journal, vol. 3, pp. 175-184, 1960. 24. B. V. Sheela, “An optimized step-size random search,” Computer Methods in Applied Mechanics and Engineering, vol. 19, no. 1, pp. 99-106, 1979. 25. J A. Snyman, Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms. Springer Publishing, 2005. 26. J. A. Starzyk, ”Motivation in Embodied Intelligence” in Frontiers in Robotics, Automation and Control, I-Tech Education and Publishing, Oct. 2008, pp. 83-110. [Online] Available: http://www.intechweb.org/book.php?%20id=78&content=subject&sid=11 27. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998. 28. V. Torczon, “On the Convergence of Pattern Search Algorithms,” SIAM Journal on Optimization, vol. 17, no. 1, pp. 1-25, 1997. 29. J. Vandekerckhove, “General simulated annealing algorithm” MATLAB central file exchange, [Online] Available: http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=10548. 30. T. J. Ypma, Historical development of the Newton-Raphson method, SIAM Review vol. 37, no. 4, pp. 531–551, 1995.