Learning of a Single-hidden Layer Feedforward Neural Network using an Optimized Extreme Learning Machine Tiago Matiasa,b,∗, Francisco Souzaa,b , Rui Araújoa,b , Carlos Henggeler Antunesb,c a Institute
of Systems and Robotics (ISR-UC), University of Coimbra, Pólo II, PT-3030-290 Coimbra, Portugal of Electrical and Computer Engineering (DEEC-UC), University of Coimbra, Pólo II, PT-3030-290 Coimbra, Portugal c Institute for Systems Engineering and Computers (INESC), R. Antero de Quental, 199, PT-3000-033 Coimbra, Portugal
b Department
Abstract This paper proposes a learning framework for single-hidden layer feedforward neural networks (SLFN) called optimized extreme learning machine (O-ELM). In O-ELM, the structure and the parameters of the SLFN are determined using an optimization method. The output weights, like in the batch ELM, are obtained by a least squares algorithm, but using Tikhonov’s regularization in order to improve the SLFN performance in the presence of noisy data. The optimization method is used to select the set of input variables, the hidden-layer configuration and bias, the input weights and the Tikhonov’s regularization factor. The proposed framework has been tested with three optimization methods (genetic algorithms, simulated annealing, and differential evolution) over sixteen benchmark problems available in public repositories. Keywords: Optimized extreme learning machine, Single-hidden layer feedforward neural networks, Genetic algorithms, Simulated annealing, Differential evolution.
1. Introduction
sidering that the activation function of the output neuron is linear. Since in ELM the output weights are computed based on the random input weights and bias of the hidden nodes, there may exist a set of non-optimal or unnecessary input weights and bias of the hidden nodes. Furthermore, the ELM tends to require more hidden neurons than conventional tuning-based learning algorithms (based on error backpropagation or other learning methods where the output weights are not obtained by the least squares method) in some applications, which can negatively affect SLFN performance in unknown testing data [6]. The use of the least squares method without regularization in noisy data also makes the model displaying a poor generalization capability [7]. Fitting problems may also be encountered in the presence of irrelevant or correlated input variables [5]. Optimization methods have been used jointly with analytical methods for network training. In [8] a new method to choose the most appropriate FFNN topology, type of activation functions and parameters of the training algorithm using a genetic algorithm (GA) is proposed. Each chromosome is composed of the specification of the minimization algorithm used in the backpropagation (BP) method, the network architecture, the
Multilayer feedforward neural networks (FFNN) have been used in the identification of unknown linear or non-linear systems (see, e.g. [1], and [2]). Their appeal is based on their universal approximation properties [3, 4]. However, in industrial applications, linear models are often preferred due to faster training in comparison with multilayer FFNN trained with gradientdescent algorithms [5]. In order to overcome the slow construction of FFNN models, a new method called extreme learning machine (ELM) is proposed in [6]. This method is a new batch learning algorithm for singlehidden layer FFNN (SLFN) where the input weights (weights of connections between the input variables and neurons in the hidden-layer) and the bias of neurons in the hidden-layer are randomly assigned. The output weights (weights of connections between the neurons in the hidden-layer and the output neuron) are obtained using the Moore–Penrose (MP) generalized inverse, con∗ Corresponding
author Email addresses:
[email protected] (Tiago Matias),
[email protected] (Francisco Souza),
[email protected] (Rui Araújo),
[email protected] (Carlos Henggeler Antunes)
1
activation function of the neurons of the hidden layer, and the activation function of the neurons of the output layer using binary encoding. In [9] a new nonlinear system identification scheme is proposed, where differential evolution (DE) is used to optimize the initial weights used by a Levenberg Marquardt (LM) algorithm in the learning of a FFNN. In [10] a similar method is proposed using a simulated annealing (SA) approach. In these algorithms, the evaluation of each individual or state is made training the FFNN with a BP method, which is computationally expensive. In [11] an improved GA is used to optimize the structure (connections layout) and the parameters (connection weights and biases) of a SLFN with switches. The switches are unit step functions that make possible the removal of each connection. Using a real encoding scheme, and new crossover and mutation techniques, this improved GA obtains better results in comparison with traditional GAs. The structure and the parameters of the same kind of SLFN with switches are also tuned in [12], in this case using a hybrid Taguchi GA. This approach is similar to a traditional GA but a Taguchi method [13] is used for the crossover process. The use of this method implies the construction of a (n + 1) × n two-level orthogonal matrix, where n is the number of variables for the optimization process. However, the construction of this orthogonal matrix is not simple. There are some standard orthogonal matrices but they can be only used when n is small. In large networks, n is large and therefore this method is not a good practical approach. In these methodologies, the weights between the hidden-layer and the output layer are optimized by the GA. Using the ELM approach, the output weights could be calculated using the Moore-Penrose generalized inverse (considering an output neuron with linear activation function) and a good solution could be quickly obtained, reducing the convergence time of the GA. Furthermore, as the number of variables of the optimization process is lower, the search space to be explored by the GA narrows. This approach was used in [14] where a GA is used to tune the (selective existence of) connections and parameters between the input layer and the hidden layer, and a least squares algorithm is applied to tune the parameters between the hidden layer and the output layer. However, in this type of approach it is difficult to deal with the tendency to require more hidden nodes than conventional tuning-based algorithms, as well as the problem caused by the presence of irrelevant variables is difficult to solve. These problems occur also in the methods proposed in [15] and [16]. In [15] a learning method called evolutionary ELM (E-ELM) is proposed, where the weights between
the input layer and the hidden layer and the bias of the hidden layer neurons are optimized by a DE method and the output weights are calculated using the MoorePenrose generalized inverse like in ELM. In [16] a similar method called self-adaptive E-ELM (SaE-ELM) is proposed; however, in this methodology the generation strategies and control parameters of the DE method are self-adapted by the optimization method. In this paper, a novel learning framework for SLFNs called optimized extreme learning machine (O-ELM) is proposed. This framework uses the same concept of the ELM where the output weights are obtained using least squares, however, with the difference that Tikhonov’s regularization [17] is used in order to obtain a robust least squares solution. The problem of reduction of the ELM performance in the presence of irrelevant variables is well known, as well as its propensity for requiring more hidden nodes than conventional tuning-based learning algorithms. To solve these problems, the proposed framework uses an optimization method to select the set of input variables and the configuration of the hidden-layer. Furthermore, in order to optimize the fitting performance, the optimization method also selects the weights of connections between the input layer and the hidden-layer, the bias of neurons of the hidden-layer, and the regularization factor. Using this framework, no trial-and-error experiments are needed to search for the best SLFN structure. In this paper, three optimization methods (GA, SA, and DE) are tested in the proposed framework. The paper is organized as follows. The SLFN architecture is overviewed in Section 2. Section 3 gives a brief review of the batch ELM. The proposed learning framework is presented in Section 4. Section 5 gives a brief review of the optimization methods tested in the OELM. Section 6 presents experimental results. Finally, concluding remarks are drawn in Section 7. 2. Adjustable Single Hidden-Layer Feedforward Network Architecture The neural network considered in this paper is a SLFN with adjustable architecture as shown in Fig. 1, which can be mathematically represented by: h X y = g bO + w jO v j , (1) j=1 n X (2) v j = f j b j + wi j si xi . i=1
n and h are the number of input variables and the number of the hidden layer neurons, respectively; v j is the
2
Hidden Layer
Inputs
x1
s1
fj
wjO
g
wn1 bj
y
bO whO
wnj wnh
(3)
where y = [y(1), . . . , y(N)]T is the vector of outputs of the SLFN, wO = [w1O , . . . , whO ]T is the vector of output weights, and V is the matrix of the outputs of the hidden neurons (1) given by: v1 (1) v1 (2) . . . v1 (N) . .. .. .. , (4) V = .. . . . vh (1) vh (2) . . . vh (N)
w1O
b1
w1h
xn sn
function, (1) and (2) can be rewritten as: T y = wO T V ,
f1
w11 w1j
Output
fh
with si = 1, i = 1, . . . , n. Considering that the input W, b2 b1 w w 11 12 W = . .. . . . wn1 wn2
bh Figure 1: Single hidden-layer feedforward network with adjustable architecture.
weights and bias matrix ... ... .. .
bh w1h .. .
...
wnh
,
(5)
is randomly assigned, the output weights vector wO is estimated as: ˆ O = V† yd , w (6)
output of the hidden-layer neuron j; xi , i = 1, . . . , n, are the input variables; wi j is the weight of the connection between the input variable i and the neuron j of the hidden layer; w jO is the weight of the connection between neuron j of the hidden layer and the output neuron; b j is the bias of the hidden layer neuron j, j = 1, . . . , h, and bO is the bias of the output neuron; f j (·) and g(·) represent the activation function of the neuron j of the hidden layer and the activation function of the output neuron, respectively. si is a binary variable used in the selection of the input variables during the design of the SLFN. Using the binary variable si , i = 1, . . . , n, each input variable can be considered or not. However, the use of variables si is not the single tool to optimize the structure of the SLFN. The configuration of the hidden layer can be adjusted in order to minimize the output error of the model. The activation function f j (·), j = 1, . . . , h, of each hidden node can be either zero, if this neuron is unnecessary, or any (predefined) activation function.
where V† is the Moore-Penrose generalized inverse of the hidden-layer output matrix V, and yd = [yd (1), . . . , yd (N)]T is the desired output. Considering that V ∈ RN×h with N ≥ h and rank(V) = h, the Moore-Penrose generalized inverse of V can be given by: V† = (VT V)−1 VT .
(7)
Substituting (7) into (6), the estimation of wO can be obtained by the following least-squares solution: ˆ O = (VT V)−1 VT yd . w
(8)
4. Optimized Extreme Learning Machine In O-ELM, the weights of the output connections are obtained using the same ELM methodology presented in Section 3, however, with a change. The objective of the least squares method is to obtain the best output weights by solving the following problem: min(ky − yd k2 ), (9)
3. Extreme Learning Machine The batch ELM was proposed in [6]. In [18] it is proved that a SLFN with randomly chosen weights between the input layer and the hidden layer and adequately chosen output weights are universal approximators with any bounded non-linear piecewise continuous functions. Considering that N samples are available, the output bias is zero, and the output neuron has a linear activation
where || · ||2 is the Euclidean norm. The minimum-norm solution to this problem is given by (8). The use of least squares can be considered as a twostage minimization problem involving: the determination of the solutions to (9), and the determination of the 3
where sλj ∈ {0, 1, 2}, j = 1, . . . , h, is an integer variable that defines the activation function f j of each neuron j of the hidden-layer as follows: 0 if sλj = 0, 1/(1 + exp(−υ)) if sλj = 1, f j (υ) = (15) υ if sλj = 2.
solution with minimum norm among solutions obtained in the previous stage. The use of Tikhonov’s regularization [17] allows the transformation of this two-stage problem into a singlestage minimization problem defined by: min(ky − yd k2 + αkwO k2 ),
(10)
The use of parameters sλj makes it possible the adjustment of the number of neurons (if sλj = 0 the neuron is not considered), and the activation function of each neuron (sigmoid or linear function). In this work only these two types of activation function have been used; however, any type of activation function can be considered. This optimization problem is a problem where the decision variables are a combination of real, integer, and binary variables. So, in order to use any type of optimization method, the approach suggested in [20] was used. The decision variables are mapped into real variables within the interval [0,1] and before computing the evaluation function for each individual, all variables need to be converted into their true value. If the true value of the l-th variable (l = 1, 2, . . . , v) of individual k is real, it is given by:
where α > 0 is a regularization parameter. The solution to this problem is given by [17]: ˆ O = (VT V + αI)−1 VT yd , w
(11)
where I is the h × h identity matrix. If V is ill-conditioned, problem (10) should be preferred over problem (9) because the solution is numerically more stable [19]. Furthermore, using the Tikhonov’s regularization, the robustness of the least squares solution against noise is improved. As previously mentioned, the ELM tends to require more hidden nodes than conventional tuning-based algorithms. Furthermore, the presence of irrelevant variables in the training data set causes a decrease in the performance. To overcome these problems, in O-ELM the determination of the set of input variables, the number and activation function of the neurons in the hidden layer, the connection weights between the inputs and the neurons of the hidden layer, the bias of the hidden layer neurons, and the regularization parameter α is made using an optimization methodology. The optimization of the SLFN consists in minimizing the following evaluation function: ψ = Ermse (y, yd ),
min − zmin zkl = (zmax l l )pkl + zl ,
where zmin and zmax represent the true variable bounds l l min (zl ≤ zkl ≤ zmax ). If it is a integer value: l + 1)pkl + zmin − zmin zkl = rounddown (zmax l , l l
Ermse (y, yd ) =
N 1 X y(k) − yd (k) 2 N k=1
(13)
where round(·) is a function that rounds to the nearest integer. The variables si , i = 1, . . . , n, are binary variables and thus are converted using (18). The variables sλj , j = 1, . . . , h, are integer variables and thus are converted using (17), considering that the lower and upper bounds are 0 and 2, respectively. The input weights wi j and bias b j are converted using (16), considering that the lower and upper bounds are -1 and 1. Finally, the regularization parameter is also converted using (16), considering the lower and upper bounds are 0 and 100. The proposed framework can be jointly used with a wide range of optimization methods. In this paper three optimization methods (GA, DE, and SA) were
is the root mean square error (RMSE) between the desired (real) output and the estimated values of the output. To improve the generalization performance, the estimation error Ermse (y, yd ) is obtained in a validation data set that has no overlap with the training data set. In the optimization process, it is considered that the individual/state will be constituted by: pk
= [w11 , . . . , wnh , b1 , . . . , bh ,
k
s1 , . . . , sn , sλ1 , . . . , sλh , α]T ; = 1, . . . , m,
(17)
where rounddown(·) is a function that rounds to the greatest integer than is lower than or equal to its argument. If the true value is binary, it is given by: (18) zkl = round pkl ,
(12)
where v u t
(16)
(14) 4
tested, resulting in three learning approaches called genetically O-ELM (GO-ELM), O-ELM by differential evolution (DEO-ELM), and O-ELM by simulated annealing (SAO-ELM). These optimization methods were chosen due to the good performance revealed in several combinatorial optimization problems.
are mapped into continuous values between 0 and 1, before obtaining the fitness of each individual these values need to be converted into the actual variable values using (16)-(18). After this step, the fitness of each individual can be obtained using (12). 5.1.3. Genetic Operators Based upon their fitness values, a set of individuals is selected to survive to the next generation while the remaining are discarded. The surviving individuals become the mating pool and the discarded chromosomes are replaced by new offspring. In this work 40% of the individuals with best fitness survive for the next generation. To select the parents from the mating pool, tournament selection was used. For each parent, three individuals from the mating pool are randomly picked and the individual with best fitness is selected to be a parent. For each pair of parents two new individuals (offspring) are generated by crossover and mutation. The crossover operation consists in producing offspring from the selected parents. Uniform crossover is used because it generally provides a larger exploration of the search space than other crossover operators [25]. In uniform crossover, first a random binary mask with the same length of the individuals is created. Then, each offspring receives values of variables from the first or second parent depending on whether the value of the mask bit is zero or one: the offspring 1/(2), receives the values from parent 1/(2) if the respective mask bit is one and receives the values from parent 2/(1) if the respective mask bit is zero. Consider the following example: Parent 1 = p11 p12 p13 p14 , Parent 2 = p21 p22 p23 p24 , Mask = 1 0 0 1 Offspring 1 = p11 p22 p23 p14 , Offspring 2 = p21 p12 p13 p24 . After crossover, the mutation operator is used to maintain the diversity of the population and to prevent the algorithm from being trapped in local minima, thus enabling a thorough exploration of the search space. For each new offspring a random number r is generated and if r < rm , where rm is the mutation probability, this offspring is mutated. The mutation used is a two step operator. First, a random element of the individual is replaced by an uniform random value within the iT h interval [0,1]. Being pk = pk1 , pk2 , pk3 , pk4 the offspring, if the second element is selected to be replaced, the mutated chromosome is given by:
5. Optimization Methods This section presents the three methods used in the optimization of the SLFN. 5.1. Genetic Algorithms The basic principles of GAs were introduced by Holland [21]. GA are population-based search and optimization methods that have been used to solve complex problems effectively [22, 23, 24]. The principle of GA is the simulation of the natural processes of evolution applying the Darwin’s theory of natural selection. In a GA, solutions are encoded into chromosomes (individuals) and the fittest ones are more susceptible to be selected for reproduction, producing offspring with characteristics of both parents. The GA used in the optimization of the SLFN is based on the approach proposed in [20]. The detailed description of this method is given in this section. 5.1.1. Variable Encoding and Initial Population As previously mentioned, the optimization of the SLFN is a problem in which the cost function involves real, integer, and binary variables. So, in order to solve this optimization problem all variables are mapped into continuous variables within the interval [0,1]. The initial population P is randomly chosen with uniform distribution and can be represented by:
where
P = {p1 , p2 , . . . , pm },
(19)
iT h pk = pk1 , pk2 , . . . , pkq .
(20)
pkl is the variable l, l = 1, 2, . . . , q, of chromosome k, k = 1, 2, . . . , m, with 0 ≤ pkl ≤ 1. m and q are the population size and the number of variables to be tuned, respectively. 5.1.2. Evaluation Each chromosome represents one possible solution to the optimization problem. Therefore each chromosome can be evaluated using a fitness function that is specific to the problem being solved. As all variables 5
iT h p1k = pk1 , p′k2 , pk3 , pk4 ,
(21)
where p′k2 is a new random value within the interval [0,1]. In a second step, a random adjustment factor is added to the chromosome. The adjustment factor comes from multiplying each element l within the previously mutated chromosome p1k by a random number (−1 ≤ βkl ≤ 1) and multiplying the resulting chromosome by a mutation factor (0 ≤ ηk ≤ 1) so that: iT h pck = ηk βk1 pk1 , βk2 p′k2 , βk3 pk3 , βk4 pk4 .
Finally, the mutated chromosome is given by: p2k = rem p1k + pck ,
where U[0,1] is a uniformly distributed random number within the interval [0,1], C is the acceptance probability, and δk ∈ {1, . . . , m} is a random index for individual k. The individual uk is selected to substitute pk in the population P if its performance is not worse than the performance of pk . Like in the GA, before evaluating the performance of an individual its component variables need to be converted into the actual variable values in the application domain using (16)-(18). The fitness of each individual is obtained by the evaluation function (12).
(22)
5.3. Simulated Annealing Simulated annealing (SA) is a meta-heuristic proposed in [28] for global optimization problems. It works by emulating the physical process of annealing a solid to obtain a state where the energy configuration is minimal. iT h Consider the solution pk = pk1 , pk2 , . . . , pkq in iteration k. In the SA algorithm, a perturbation is applied to pk resulting in a neighbor solution vk . Considering a minimization problem, this neighbor will be then accepted as the new current solution according to the following acceptance probability C:
(23)
where rem is the remainder of each variable after the division by one. 5.2. Differential Evolution Differential evolution grew out of attempts to solve the Chebychev Polynomial fitting Problem. In 1995 the first publication about DE appeared in a technical report written by Storn and Price [26]. Due to the excellent performance in the 1996 and 1997 International Contest on Evolutionary Optimization in the IEEE International Conference on Evolutionary Computation, DE began to gain prominence in the scientific community [27]. DE is a population-based meta-heuristic approach where new candidate solutions are generated adding a weighted difference vector between two population members to a third member. Like in GA, in DE the initialization of the population is randomly performed with uniform distribution within the interval [0,1], yielding an initial population with m individuals (19). After the population initialization and while a stop criterion is not met, for each individual pk (k = 1, 2, . . . , m) of the population three random individuals (ρ, σ, and τ) are selected, with ρ , σ , τ. The difference vector between two of these individuals is calculated. This difference vector is multiplied by a scaling factor and is added to the third selected individual obtaining the trial vector p′k : p′k = pρ + F(pσ − pτ ),
1, C= ψ(vk ) − ψ(pk ) exp , T k
if ψ(vk ) ≤ ψ(pk ), otherwise,
(26) where ψ(vk ) and ψ(pk ) are the fitness values of vk and pk as given by (12) and T k ∈ R+ is the temperature in iteration k. Like in the other two optimization methods (GA and DE), before evaluating the performance of a state its component variables need to be converted into the actual variable values in the application domain using (16)-(18). In each iteration the temperature is decreased as T k+1 = γT k , where γ ∈ (0, 1) is a user-defined constant. 6. Results This section presents experimental results in 16 benchmark data sets. Table 1 presents the number of training samples Nt , the number of testing samples Ntt , and the number of input variables of these benchmark data sets. In these data sets, delays between each input variable and the output could be considered and the performance of the methods could possibly be improved [33]; however, as only the learning capability of the SLFN is being analyzed, no delay between the input variables and the output variable in all data sets has been considered.
(24)
where F is the scaling factor. In order to increase the diversity of the individuals, the following vector uk = iT h uk1 , uk2 , . . . , ukq is created, where each element l, l = 1, 2, . . . , q, is given by: ( ′ pkl , if U[0,1] ≤ C ∨ l = δk , (25) ukl = pkl , otherwise, 6
Table 1: Publicly available benchmark data sets. Dataset Boston Housing [29] Automobile MPG [29] Cancer [29] Servo [29] Price [29] CPU [29] Concrete Comp. Strength [29] Ailerons [30] Pumadyn [31] Elevators [30] Pole Telecommunication [30] Bank [31] Computer Activity [31] Stock [30] Communities and Crime [29] Pyrimidines [32]
Nt
Ntt
n
253 196 97 84 80 105 515 7154 4499 8752 5000 4500 4096 475 997 37
253 196 97 83 80 104 515 6596 3693 7847 10000 3692 4096 475 997 37
13 6 32 4 15 6 8 40 32 18 48 32 21 9 127 27
Table 2: Results of the application of the three optimization methods in the optimized extreme learning machine in the benchmark data sets using the RMSE as the performance measure.
Data set Boston Housing
Method
GO-ELM DEO-ELM SAO-ELM Automobile GO-ELM MPG DEO-ELM SAO-ELM Cancer GO-ELM DEO-ELM SAO-ELM Servo GO-ELM DEO-ELM SAO-ELM Price GO-ELM DEO-ELM SAO-ELM CPU GO-ELM DEO-ELM SAO-ELM Concrete GO-ELM Comp. DEO-ELM Strength SAO-ELM Ailerons GO-ELM DEO-ELM SAO-ELM Pumadyn GO-ELM DEO-ELM SAO-ELM Elevators GO-ELM DEO-ELM SAO-ELM Pole GO-ELM Telecommu- DEO-ELM nication SAO-ELM Bank GO-ELM DEO-ELM SAO-ELM Computer GO-ELM Activity DEO-ELM SAO-ELM Stock GO-ELM DEO-ELM SAO-ELM Communities GO-ELM and Crime DEO-ELM SAO-ELM Pyrimidines GO-ELM DEO-ELM SAO-ELM
The data was divided as follows: the training and testing sets available in the repositories were used; when these separate data sets are not provided, the first half of the data was used for training, and the second half was used for testing. All the input and output variables have been normalized to the range [-1,1]. All simulations have been made in Matlab environment running on a PC with 3.40GHz CPU with 4 cores and 8GB RAM. In order to select which of the three optimization methods will be used in O-ELM, the three approaches GA (GO-ELM), DE (DEO-ELM), and SA (SAO-ELM) were tested in the benchmark data sets of Table 1. There is a Matlab toolbox available online for performing the O-ELM framework, including the three tested optimization methods [34]. The performance of the methods is evaluated using mean and standard deviation of RMSE between the estimated and desired outputs in 20 trials. As in the O-ELM learning methods the structure of the SLFN is optimized, it was considered that the initial number of neurons in the hidden-layer is 30. 30% of the training data set was randomly picked and was used as the validation data set. In GA and DE, the population size was 80 individuals and the maximum number of iterations was 100. In GA, the mutation probability was 10%. In the SA, the maximum number of iterations was 500, the initial temperature was 100 and decreased with γ = 0.95. As SA is not population-based, a maximum number of iterations five times higher than in GA and DE was considered. In the DE it was considered that F is a random number uniformly distributed in the interval [0.5,2]. Table 2 presents the average of the 20 trials of the
Testing RMSE Mean Std 0.3503 0.0565 0.3530 0.0591 0.3609 0.0648 0.2608 0.049 0.2652 0.0522 0.2699 0.0508 0.5782 0.0229 0.6317 0.0183 0.6326 0.0224 0.2320 0.0428 0.2701 0.0248 0.2757 0.0273 0.1896 0.0231 0.2309 0.0662 0.1902 0.0308 0.1733 0.0281 0.1730 0.0058 0.1624 0.0067 0.2738 0.0230 0.2885 0.0403 0.3104 0.0485 0.0912 0.0007 0.0918 0.0006 0.0946 0.0015 0.1442 0.0333 0.293 0.0044 0.3074 0.0016 0.0774 0.0031 0.0831 0.0087 0.0872 0.0052 0.4901 0.0286 0.5147 0.0122 0.5888 0.0234 0.2051 0.0017 0.2046 0.0005 0.2117 0.0047 0.0848 0.0118 0.0926 0.0155 0.1377 0.0230 0.6767 0.3192 0.7359 0.2155 0.9666 0.3019 0.2679 0.0073 0.2729 0.0075 0.2767 0.0079 0.2372 0.0535 0.2238 0.0359 0.2376 0.0447
Training Hidden Time (s) Neurons 4.5463 21.15 8.3319 21.15 2.4084 18.45 4.3751 21.75 7.3566 21.75 1.1807 19.35 4.0442 18.25 7.4927 17.30 3.0648 18.50 3.6212 21.75 6.7205 21.75 0.9686 19.30 3.7553 18.70 6.9191 18.70 1.7578 19.60 3.7444 17.20 7.2471 17.20 1.1167 18.55 5.4051 22.05 9.6447 22.05 1.7068 19.60 27.4410 23.55 45.6752 23.55 22.4723 20.50 17.9558 22.95 28.0148 22.95 11.5952 21.10 30.3114 24.05 48.2631 23.65 12.9927 19.90 22.0324 25.25 35.9068 25.25 20.6218 20.30 16.2897 23.40 29.0433 23.40 12.0247 20.15 17.9445 23.75 25.3110 23.70 7.7895 21.25 9.5224 23.85 9.5277 23.85 1.7721 20.55 9.6412 22.10 17.2391 22.10 21.3763 21.15 3.8585 20.60 7.2562 20.60 2.6047 19.9
three optimization methods in the optimized extreme learning machine in all data sets and the mean number of neurons in the hidden-layer obtained after the optimization. For all data sets, the best RMSE is shown in 7
Continued from previous column Testing RMSE Training Hidden Data set Method Mean Std Time (s) Neurons ELM 0.3205 0.0031 0.0360 28.00 IGA-SLFN 0.3702 0.0354 8.6215 23.00 SaE-ELM 0.3125 0.0013 237.2725 30.00 LM-SLFN 0.0769 0.0057 45.0315 23.00 Elevators GO-ELM 0.0774 0.0031 30.3114 24.05 ELM 0.0882 0.0040 0.0570 26.00 IGA-SLFN 0.4000 0.1951 11.2353 22.00 SaE-ELM 0.0776 0.0015 369.6585 27.00 LM-SLFN 0.0655 0.0028 63.2730 15.00 Pole GO-ELM 0.4901 0.0286 22.0324 25.25 TelecommuELM 0.6067 0.0133 0.0410 29.00 nication IGA-SLFN 0.9116 0.1000 5.8895 22.00 SaE-ELM 0.5209 0.0111 264.5200 30.00 LM-SLFN 0.1865 0.0267 49.1555 26.00 Bank GO-ELM 0.2051 0.0017 16.2897 23.40 ELM 0.6067 0.0133 0.0410 29.00 IGA-SLFN 0.3171 0.0269 5.5643 15.00 SaE-ELM 0.5209 0.0111 264.5200 30.00 LM-SLFN 0.2060 0.0037 14.3920 16.00 Computer GO-ELM 0.0848 0.0118 17.9445 23.75 Activity ELM 0.1765 0.0151 0.0340 28.00 IGA-SLFN 0.5508 0.3076 4.7953 24.00 SaE-ELM 0.1028 0.0102 209.9595 29.00 LM-SLFN 0.0545 0.0030 9.9655 15.00 Stock GO-ELM 0.6767 0.3192 9.5224 23.85 ELM 0.7122 0.2705 0.0075 19.00 IGA-SLFN 0.8848 0.5259 2.2219 16.00 SaE-ELM 0.6860 0.3295 63.7135 15.00 LM-SLFN 0.8393 0.4125 1.3660 21.00 Communities GO-ELM 0.2679 0.0073 9.6412 22.10 and Crime ELM 0.2932 0.0081 0.0110 28.00 IGA-SLFN 0.4338 0.0528 7.4356 28.00 SaE-ELM 0.2716 0.0055 92.2715 26.00 LM-SLFN 0.3454 0.0274 22.0185 15.00 Pyrimidines GO-ELM 0.2372 0.0535 3.8585 20.60 ELM 0.2642 0.0680 0.0005 11.00 IGA-SLFN 0.3411 0.1333 1.6119 19.00 SaE-ELM 0.2399 0.0464 30.6100 19.00 LM-SLFN 0.3208 0.0818 0.3740 16.00
bold face. From the analysis of the results it can be verified that the number of neurons in the hidden layer obtained in the three approaches is similar. Concerning training time, in general SAO-ELM is the fastest method and DEO-ELM is the slowest. With respect to fitting performance it can be verified that GO-ELM shows the lowest mean of RMSE in the testing data set in 13 of the 16 benchmark data sets. Therefore the GA was selected as the optimization method to be used in the O-ELM. Table 3: Results of the application of the five methods in the benchmark data sets using the RMSE as the performance measure. Testing RMSE Training Hidden Mean Std Time (s) Neurons Boston GO-ELM 0.3503 0.0565 4.5463 21.15 Housing ELM 0.4652 0.1901 0.0020 20.00 IGA-SLFN 0.5208 0.1278 1.6540 28.00 SaE-ELM 0.4594 0.1628 8.0205 15.00 LM-SLFN 0.5182 0.1677 1.0935 15.00 Automobile GO-ELM 0.2608 0.0490 4.3751 21.75 MPG ELM 0.2610 0.0443 0.0025 19.00 IGA-SLFN 0.4874 0.1582 1.3098 23.00 SaE-ELM 0.2682 0.0491 7.5290 15.00 LM-SLFN 0.4030 0.1109 0.7050 17.00 Cancer GO-ELM 0.5782 0.0229 4.0442 18.25 ELM 0.5864 0.0240 0.0015 11.00 IGA-SLFN 0.5706 0.1478 2.1523 22.00 SaE-ELM 0.6169 0.0287 7.6500 15.00 LM-SLFN 0.8288 0.2090 1.0375 15.00 Servo GO-ELM 0.2320 0.0428 3.6212 21.75 ELM 0.2401 0.0263 0.0030 24.00 IGA-SLFN 0.5316 0.1419 1.0425 15.00 SaE-ELM 0.2301 0.0261 6.6805 15.00 LM-SLFN 0.2713 0.0566 0.5555 15.00 Price GO-ELM 0.1896 0.0231 3.7553 18.70 ELM 0.2092 0.0431