Parameter Optimization of Extreme Learning Machine ... - CiteSeerX

27 downloads 406 Views 351KB Size Report
In this paper, an optimization method based on the bacterial foraging (BF) algorithm is proposed to ... networks(SLFNs), Extreme Learning Machine, Bacterial Foraging. Algorithm .... engineering problems. To perform social foraging an animal.
Parameter Optimization of Extreme Learning Machine Using Bacterial Foraging Algorithm Jae-Hoon Cho and Myung-Geun Chun Dept. of Electrical and Computer Engineering Chungbuk National Unviersity Cheong-Ju, Korea Email: [email protected], [email protected]

Abstract—Recently, Extreme learning machine(ELM), a novel learning algorithm having much faster than the traditional gradient-based learning algorithm, was proposed for single-hidden-layer feedforward neural networks (SLFNs). Usually, the initial input weights and hidden biases of ELM are randomly chosen, and then the output weights are analytically determined by using Moore-Penrose (MP) generalized inverse. However, ELM may need higher number of hidden neurons due to the random determination of the input weights and hidden biases. In this paper, an optimization method based on the bacterial foraging (BF) algorithm is proposed to adjust the input weights and hidden biases. Experimental result shows that this method can achieve better performance for problems having higher dimension than others. Index Terms— Single-hidden-layer feedforward neural networks(SLFNs), Extreme Learning Machine, Bacterial Foraging Algorithm

I. INTRODUCTION Neural networks were derived from a biological brain neuron and which was a modeling method to find patterns of data. Neural networks have been applied to various fields such as prediction and classification [1]. Neural Networks are often classified into supervised or unsupervised learning according to the learning method. Among them, multi-layer neural networks have been widely used in pattern recognition and regression [2-4]. The multi-layer neural network proposed by Daid Rumelhart is trained by backpropagation based on a gradient-based learning rule [5]. Up to now, the gradient-based learning methods have been widely applied for learning of multi-layer neural networks [6][7]. However, it has several shortcomings such as difficult setting of learning parameters, slow convergence, training failures due to local minima, and repetitive learning to improve performance of multi-layer neural networks. Also, it is clear that gradient descent-based learning methods are generally very slow, since many iterative learning steps are required by This work has been supported by EESRI(R-2007-2-046), which is funded by MOCIE(Ministry Of Commerce Industry and Energy).

Dae-Jong Lee CBNU BK21 Chungbuk Information Technology Center Chungbuk National Unviersity Cheong-Ju, Korea Email: [email protected]

such learning algorithms to obtain better learning performance. To solve the problems, Huang et al. proposed a learning method called extreme learning machine (ELM) to easily achieve good generalization performance at extremely fast learning speed [8][9]. In ELM, the input weights and the hidden layer biases are chosen randomly, and the output weights (linking the hidden layer to the output layer) are analytically determined by using Moore-Penrose (MP) generalized inverse. So, ELM not only learns much faster with higher generalization performance than the traditional gradient-based learning algorithms but also avoids many difficulties faced by gradient-based learning methods such as stopping criteria, learning rate, learning epochs, and local minima. However, ELM usually needs higher number of hidden neurons due to the random determination of the input weights and hidden biases. To solve the problem, the method taking advantages of both ELM and differential evolution (DE) is proposed by Zhu[10]. On the other hand, Bacterial foraging (BF) algorithm mimicking biological bacterial food-searching behavior has been applied in the field of optimization. Optimization methods inspired by bacteria are the chemotaxis algorithm and bacterial foraging algorithm. The chemotaxis algorithm is pioneered by Bremermann and his colleagues and is proposed for analogy to the way bacteria to react chemoattractants in concentration gradients. The bacterial foraging algorithm by Passino is based on bacterial chemotaxis, reproduction and elimination-dispersal events [11,12]. Here, we adopt the BF to search for the optimal input weights and hidden biases. In the proposed method, after initial position of bacteria is randomly chosen, the bacterium tries to find optimal input weights and hidden biases. We show that this method can achieve better performance for overcoming the above mentioned problems of ELM. This paper is organized as follows. In Section II, ELM algorithm is introduced. In Section III, the bacterial foraging algorithm is briefly described. In Section IV, experimental results of proposed method and discussion are presented. Finally, conclusions are given in Section V.

II. EXTREME LEARNING MACHINE ALGORITHM ELM was proposed by Huang, et al.[8]. All the parameters of multi-layer neural networks based on gradient descent-based learning methods need to be learned and usually many iterative learning steps are required to obtain better learning performance. So, gradient descent based learning methods are apt to be slow due to improper learning steps or may easily converge to local minimums. In the ELM, the output weights are analytically computed by using the MP generalized inverse instead of iterative learning scheme. Fig. 1 shows the learning procedure and structure in ELM. As shown in Fig.1, the ELM consists of single-hidden layer feedforward networks (SLFNs). The significant features of ELM can be summarized as follows : - The learning speed of ELM is extremely fast. It can train SLFNs much faster than classical learning methods. - The ELM tends to reach not only the smallest training error but also the smallest norm of weights. Thus, the ELM tends to have good performance for neural networks. - The ELM learning algorithm can be used to train SLFNs with non-differentiable activation functions. - The ELM tends to reach the solutions straightforward without such trivial issues. Assuming that we are training SLFNs with K hidden neurons to learn N distinct samples( xi , ti ), where x i = [ x i1 , x i 2 , , x in ]T ∈ ℜ n ~ and t i = [t i1 , t i 2 , , t in ]T ∈ ℜ m , SLFNs with N hidden neurons and activation function g (x) are mathematically modeled as ~ N

∑ β i g (vi ⋅ x j + bi ) = o j ,

i =1

j = 1,

(1)

,N

where, v i = [v i1 , v i 2 , , v in ]T is the weight vector connecting the i th hidden neuron and the input neurons, and

wi = [ wi1 , wi 2 ,

product of v i and x j . The output neurons are chosen linear. ~ That standard SLFNs with N hidden neurons with activation function g ( x) can approximate these N samples with zero ~

error means that ∑ Nj=1 o j − t j = 0 , i.e., there exist wi , v i and

The above N equations can be written concisely as:

Hw = T

∑ wi g ( v i ⋅ x j + bi ) = t j ,

where H( v 1,

, v N~ , b1 ,

b N~ , x1 ,

⎡ g ( v 1 ⋅ x1 + b1 ) ⎢ ⎢ ⎢ g ( v 1 ⋅ x N + b1 ) ⎣

(3)

x N~ ) =

g ( v N~ ⋅ x1 + b N~ ) ⎤ ⎥ ⎥ g ( v N~ ⋅ x N + b N~ )⎥⎦ (4)

bi such that

i =1

(b) The learning process of ELM Fig. 1. Structure and learning process in ELM

, wim ]T is the weight vector connecting the i th

hidden neuron and output neuron. v i ⋅ x j denotes the inner

~ N

(a) The structure of ELM

⎡ w1T

j = 1,

,N

(2)

⎡ t 1T

⎤ ⎤ ⎢ ⎥ ⎢ ⎥ T=⎢ ⎥ w=⎢ ⎥ ⎢ wT~ ⎥ ⎢t T ⎥ ⎣ N ⎦ N~ ×m ⎣ N ⎦ N ×m H is the hidden layer output matrix of the neural network; the i th column of H is the i th hidden neuron’s output vector with respect to inputs x1 , x 2 , , x N .

III. BACTERIAL FORAGING OPTIMIZATION Search and optimal foraging of animals can be used for solving engineering problems. To perform social foraging an animal needs communication capabilities and it gains advantages that can exploit essentially the sensing capabilities of the group, so that the group can gang-up on larger prey, individuals can obtain protection from predators while in a group, and in a certain sense the group can forage a type of collective intelligence [13,14].

A. Overview of Chemotactic Behavior of E. coli. This paper considers the foraging behavior of E. coli, which is a common type of bacteria [12]. Its behavior and movement comes from a set of six rigid spinning (100–200 r.p.s) flagella, each driven as a biological motor. An E. coli bacterium alternates through running and tumbling. Running speed is 10–20 μm / sec but they are unable to swim straight. We modeled the chemotactic actions of the bacteria as follows: - In a neutral medium, if it tumbles and runs in an alternating fashion, its action could be similar to search. - If swimming up a nutrient gradient (or out of noxious substances), or swimming for a longer period of time (climb up nutrient gradient or down noxious gradient), its behavior seeks increasingly favorable environments. - If swimming down a nutrient gradient (or up noxious substance gradient), then the search action is avoiding unfavorable environments. Subsequently, it can climb up nutrient hills and at the same time avoid noxious substances. The sensors it needs for optimal resolution are receptor proteins that are very sensitive and possess high gain. That is, a small change in the concentration of nutrients can cause a significant change in behavior. This is probably the best-understood sensory and decision-making system in biology. Mutations in E. coli affect the reproductive efficiency at different temperatures, and occur at a rate of about 10-7 per gene per generation. E. coli occasionally engages in a conjugation that affects the characteristics of a population of bacteria. There are many types of taxis that are used in bacteria such as, aerotaxis (attracted to oxygen), phototaxis (light), thermotaxis (temperature), magnetotaxis (magnetic lines of flux) and some bacteria can change their shape and number of flagella based on the medium to reconfigure in order to ensure efficient foraging in a variety of media. Bacteria can form intricate stable spatio-temporal patterns in certain semisolid nutrient substances and they can radially eat their way through a medium if placed together initially at its center. Moreover, under certain conditions, they will secrete cell-to-cell attractant signals in order to group and protect each other.

B. Optimization Function for the BF Algorithm The main goal of the BF based algorithm is to apply and find the minimum of P (φ ), φ ∈ R n not in the gradient ∇P(φ ) [12]. Here, when φ is the position of a bacterium, P (φ ) is an attractant-repellant profile. That is, where nutrients and noxious substances are located, P < 0, P = 0, P > 0 represent the presence of nutrients. A neutral medium, and the presence of noxious substances, respectively can be defined by H ( j , k .l ) = {φ x ( j , k , l ) | x = 1,2,

, N}

(5)

Eq.5 represents the position of each member in the population of the N bacteria at the jth chemotactic step, kth reproduction step, and lth elimination-dispersal event. Let P ( x, j , k , l ) denote the cost at the location of the ith bacterium φ x (i, j , k ) ∈ R n and

φ x (i + 1, j , k ) = φ x (i, j , k ) + C ( x)ϕ (i)

(6)

so that C ( x) > 0 is the size of the step taken in the random

direction specified by the tumble. If at φ x (i + 1, j , k ) the cost P ( x, j + 1, k , l ) is lower than at φ x (i, j , k ) then another

chemotactic step of size C (x) in this same direction will be taken and repeated up to a maximum number of steps N s . N s is the length of the lifetime of the bacteria measured by the number of chemotactic steps. Function Pci (φ ) , i = 1,2, … , S , to model the cell-to-cell signaling via an attractant and a repellant is represented by N

Pc (φ ) = ∑ Pcci i =1

⎡ ⎛ ⎞⎤ = ∑ ⎢− Lattract exp⎜⎜ − δ attract ∑ (φ j − φ ij ) 2 ⎟⎟⎥ i =1 ⎢ ⎝ ⎠⎥⎦ ⎣ N

N ⎡ ⎛ ⎞⎤ + ∑ ⎢− K repellant exp⎜⎜ − δ attract ∑ (φ j − φ ij ) 2 ⎟⎟⎥ i =1 ⎣ ⎢ ⎝ ⎠⎦⎥

(7)

where φ = [φ1 , … , φ p ]T is a point on the optimization domain, Lattract is the depth of the attractant released by the cell and δ attract is a measure of the width of the attractant signal. K repellant = Lattract is the height of the repellant effect magnitude,

and δ attract is a measure of the width of the repellant. The expression of Pc (φ ) means that its value does not depend on the nutrient concentration at position φ . That is, a bacterium with high nutrient concentration secretes stronger attractant than one with low nutrient concentration. The model uses the function

Par (φ ) to represent the environment-dependent cell-to-cell signaling as Par (φ ) = exp(T − P(φ )) p c (φ ) (8) where T is a tunable parameter. By considering minimization of P (i, j , k , l ) + Par (φ i ( j , k , l )) , the cells try to find nutrients, avoid noxious substances, and at the same time try to move toward other cells, but not too close to them. The function Par (φ i ( j , k , l )) implies that, with M being constant, the smaller the P(φ ) , the larger the Par (φ ) and thus the stronger attraction, which is intuitively reasonable. In tuning the parameter M, it is normally found that, when M is very large, Par (φ ) is much larger than J (φ ) and thus the profile of the search space is dominated by the chemical attractant secreted by E. coli. w11

w12

w1n

b11

b12

b1n

w12 w22

w2n

b21

b22

b2n

w1m

wmn

bm2

bmn

wm2

bm1

Fig.2. Structure of initial population in BF

form. Fig. 2 shows the structure of initial population in the proposed model and Fig. 3 is the flowchart of bacterial foraging algorithm. The main purpose of this paper is to propose an advanced bacterial foraging for finding optimal input weights and hidden biases of ELM. However, the performance of the ELM algorithm is varying according to the selection of initial input weights. Also, it is difficult to directly apply general BF to ELM because of above explained problems of ELM and high-dimensional large data. So, we modified the original bacterial foraging algorithm. First of all, we briefly described the optimization procedure of the bacterial foraging algorithm as following: [step 1] Initialize parameters n, N , N c , N re , N ed , Ped , C (i ) (i = 1,2, … , N ),. n : Dimension of the search space N : The number of bacteria in the population N c : Chemotactic steps N re : The number of reproduction steps N ed : The number of elimination-dispersal events Ped : Elimination-dispersal with probability C (i ) : The size of the step taken in the random direction specified by the tumble. In our experiment, each bacterium represents input weights and hidden biases. n is the number of input weights and hidden biases and N becomes the number of a matrix with input weights and hidden biases. C (i ) are variation amount of input weights and hidden biases during chemotactic steps. [step 2] Elimination-dispersal loop: l = l + 1 [step 3] Reproduction loop: k = k + 1 [step 4] Chemotaxis loop: j = j + 1 [step 4-1] For i = 1,2, … , N , take a chemotactic step for bacterium i as follows. [step 4-2] Compute the fitness function, MSE (i, j , k , l ) F ( fitness function ) =

1 1 + MSEtrn

(9)

where MSEtrn is the mean square error of training data . [step 4-3] Let Flast = F (i, j , k , l ) to save this value since we may find a better cost via a run. [step 4-4] Tumble: generate a random vector Δ (i ) ∈ R n with each Fig.3. flowchart of Bacterial Foraging Algorithm

On the other hand, if T is very small, then Par (φ ) is much smaller than P(φ ) . It is the effect of the nutrients that dominates. In Par (φ ) , the scaling factor of Pc (φ ) is given as in exponential

element Δ m (i ), m = 1,2, … , p, a random number on [-1, 1]. [step 4-5] Move: Let Δ (i ) (10) φ i ( j + 1, k , l ) = φ i ( j , k , l ) + C (i ) ΔT (i )Δ (i )

This results in a step of size C (i ) in the direction of the tumble for bacterium i . [step 4-6] Compute F (i, j + 1, k , l ) . If j < N c , go to [step 4]. In this case, continue chemotaxis, since the life of the bacteria is not over. [step 6] Reproduction: [step 7] If k < N re , go to [step 3]. In this case, we have not reached the number of specified reproduction steps, so we start the next generation in the chemotactic loop. [step 8] Elimination-dispersal: For i = 1,2, … , N , with probability Ped , eliminate and disperse each bacterium, [step 5]

and this results in keeping the number of bacteria in the population constant.To do these, if you eliminate a bacterium, simply disperse one to a random location on the optimization domain. If l < N ed , then go to [step 2]; otherwise end. In the above process, after the fitness value of all bacteria is calculated by chemotactic step, we apply the Elimination-dispersal step and Reproduction step. During the chemotactic step, each bacterium tries to find optimal input weights and hidden biases, and the bacterium with better fitness values are retained to the next chemotactic.

learning speed of the proposed method is faster than SVM and BP. And also, the generalization performance is better than SVM, BP and ELM. TABLE 1. INITIAL PARAMETERS OF BACTERIAL FORAGING ALGORITHM

parameter

value

N: The number of bacteria in the population

1

Nc : chemotactic steps

8

Nre : The number of reproduction steps

1

Ned : the number of elimination-dispersal events

1 0.5

Ped : elimination-dispersal with probability, C : The size of the step taken in the random

0.01

direction specified by the tumble ~ 20 N : The number of hidden neuron activation function : F ( x) = 1 /(1 + exp(− x))

IV. EXPERIMENTAL RESULTS

A. Function Approximation Problem To demonstrate performances, proposed method was applied to California Housing dataset[8]. There are 20,640 observations for predicting the price of houses in California. Information on the variables were collected using all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area. The final data contained 20,640 observations on nine variables, which consists of eight continuous inputs ( median income, housing median age, total rooms, total bedrooms, population, households, latitude, and longitude) and one continuous output (median house value). In our simulations, 8,000 training data and 12,640 testing data randomly generated from the California Housing database for each trial. The eight input attributes and one output have been normalized to the range [0,1]. The initial parameters of proposed method are shown in Table 1. Because the population size of the advanced bacterial foraging algorithm are determined by the number of hidden neuron and input attributes, the bacterium of the advanced bacteria foraging structure has 40-by-8 matrix as initial position and direction vector as shown in figure 4. Here, w and b indicate the input weights and hidden biases, respectively. 50 trials have been conducted for all the algorithms and the average results are shown in Table 2. As seen from Table 2, the

Fig.4. The proposed Bacterial Foraging Algorithm TABLE 2. PERFORMANCE COMPARISON USING CALIFORNIA HOUSING DATA.

Time(seconds) Method

Training RMS

Testing RMS

No of SVs/ Neurons

Training

Testing

ELM[8]

0.27

0.14

0.1358

0.1365

20

BP[8]

295.23

0.28

0.1369

0.1426

20

SVM[8]

558.41

20.97

0.1267

0.1275

2534

0.0173

0.0267

20

Proposed method

30.85

All the input values are within [0, 1]. For this problem, 75% and 25% samples are randomly chosen for training and testing at each trial, respectively. 50 trials have been conducted for all the algorithms and the average results are shown in Table 3. As seen in Table 3, the proposed method obtains lower generalization performance than BP for training case. However, it obtains better generalization performance than BP and SVM for testing case. V. CONCLUSIONS In this paper, an advanced method using the bacterial foraging algorithm to adjust the input weights and hidden biases was proposed. In the proposed method, after initial position of bacteria is randomly chosen, the bacteria try to find optimal input weights and hidden biases by using chemotatic step, reproduction steps and Elimination-dispersal. To demonstrate performances, the proposed method was applied to California Housing dataset and Pima Indians Diabetes Database and found that the proposed method had better generalization performance than previous others.

(a) Training data

REFERENCES

(b) Testing data. Fig. 5. Result of proposed method.

B. Real Medical Diagnosis Application The comparison of the proposed method with many other popular algorithms has been conducted for a real medical diagnosis problem using the “Pima Indians Diabetes Database” produced in the Applied Physics Laboratory, Johns Hopkins University, 1988[8]. The database consists of 768 women over the age of 21 residents in Phoenix, Arizona. All examples belong to either positive or negative class. TABLE 3. PERFORMANCE COMPARISON USING PIMA INDIANS DIABETES DATA

Method Time(seconds)

Training

Testing

No of SVs/ Neurons

Success Rate

ELM[8]

0.015

78.71

76.54

20

BP[8]

16.196

92.86

63.45

20

SVM[8]

0.1860

78.76

77.31

317.16

Propose method

2.57

79.69

79.17

20

[1] Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification, 2nd Edition, John Wiley & Sons, 2001. [2] J-S. R. Jang, C. T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing : A Computational Approach to Learning and Machine Intelligence, Prentice Hall, 1997. [3] J. S. R. Jang, "ANFIS : Adaptive Network-based Fuzzy Inference System," IEEE Trans on System, Man, and Cybernetics, Vol. 23, No. 3, pp. 665-685, 1993. [4] Mohammad Fazle Azeem, Madasu Hanmandlu, and Nesar Ahmad, "Structure Identification of Generalized Adaptive Neuro-Fuzzy Inference Systems," IEEE Trans on Fuzzy Systems, Vol. 11, No. 5, pp. 666-681. 2003. [5] D. Rumelhart, and J. McClelland, "Parallel Distributed Processing", Vol. 1, MIT Press, 1986. [6] S. Suresh, S.N. Omkar, and V. Mani, “ Parallel Implementation of Back-propagation Algorithm in Networks of Workstations ” , IEEE Transactions on Parallel and Distributed Systems, Vol. 16, pp. 24-34, 2005. [7] Hsu, C.-T. Kang, M.-S. and Chen, C.-S, "Design of Adaptive Load Shedding by Artificial Neural Networks", IEE Proceed., Generation, Transmission and Distribution, Vol. 152, pp. 415-421, 2005. [8] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, "Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks", International Joint Conference on Neural Networks (IJCNN2004), pp. 25-29, 2004. [9] M.-B. Li, G.-B. Huang, P. Saratchandran, N. Sundararajan, "Fully Complex Extreme Learning Machine", Neurocomputing, vol. 68, pp. 306-314, 2005. [10] Qin-Yu Zhu, A.K. Qin, P.N. Suganthan, Guang-Bin Huang, "Evolutionary Extreme Learning Machine", Pattern Recognition, Vol. 38, pp. 1759– 1763, 2005. [11] H. J. Bremermann, " Chemotaxis and Optimization," J. Franklin Inst., Vol. 297, pp. 397-404, 1974. [12] PASSINO, K. M., Biomimicry of Bacterial Foraging for Distributed Optimization, University Press, Princeton, New Jersey, 2001. [13] W. John O'brien, Howard I. Browman, and Barbara I. Evans, " Search Strategies of Foraging Animals," American Scientist, Vol. 78 pp. 152-160, 1990. [14] D. W. Stephens and J. R. Krebs, Foraging Theory, Princeton University Press, Princeton, New Jersey, 1986.