1
Stochastic Configuration Networks: Fundamentals and Algorithms Dianhui Wang*, Senior Member, IEEE, Ming Li, Student Member, IEEE
Abstract—This paper contributes to the development of randomized methods for neural networks. The proposed learner model is generated incrementally by stochastic configuration (SC) algorithms, termed Stochastic Configuration Networks (SCNs). In contrast to the existing randomized learning algorithms for single layer feed-forward networks (SLFNs), we randomly assign the input weights and biases of the hidden nodes in the light of a supervisory mechanism, and the output weights are analytically evaluated in either a constructive or selective manner. As fundamentals of SCN-based data modelling techniques, we establish some theoretical results on the universal approximation property. Three versions of SC algorithms are presented for data regression and classification problems in this work. Simulation results concerning both data regression and classification indicate some remarkable merits of our proposed SCNs in terms of less human intervention on the network size setting, the scope adaptation of random parameters, fast learning and sound generalization. Index Terms—Stochastic configuration networks, randomized algorithms, incremental learning, universal approximation property, data analytics.
I. I NTRODUCTION Over the past decades, neural networks have been successfully applied for data modelling due to its universal approximation power for nonlinear maps [1]–[4], and learning capability from a collection of training samples [5], [6]. In practice, however, it is quite challenging to properly determine an appropriate architecture of a neural network so that the resulting learner model can achieve sound performance for both learning and generalization. To resolve this problem, one turns to develop constructive approaches for building neural networks, starting with a small sized network, followed by incrementally generating hidden nodes and output weights until a pre-defined termination criterion is met [7], [8]. It has been well-known that the process of iterative searching for an appropriate set of the weights and biases is time-consuming and computationally intensive, and this is particularly difficult to apply for dealing with large-scale data analytics. Randomized approaches for large-scale computing are highly desirable due to their effectiveness and efficiency [9]. In machine learning, randomized algorithms have demonstrated great potential in developing fast learner models and learning algorithms with much less computational cost [10]–[12]. Readers may refer to a survey paper for more detail [11]. From an algorithm perspective, randomized learning techniques for D. Wang and M. Li are with the Department of Computer Science and Information Technology, La Trobe University, Melbourne, Victoria 3086, Australia. e-mail:
[email protected];
[email protected] * Corresponding author
neural networks received attention in the later 80s [13] and were further developed in the early 90s [14], [15]. A common and basic idea behind these randomized learning algorithms is a two-step training paradigm, that is, randomly assigning the input weights and biases of neural networks and evaluating the output weights by solving a linear equation system using the well-known least squares method and its regularized versions. In [16], Igelnik and Pao proved that a RVFL network with random parameters chosen from the uniform distribution defined over a range can be a universal approximator with probability one for continuous functions. In [17], Husmeier revisited this significant result and showed that the universal approximation property of RVFL networks holds as well for symmetric interval setting of the random parameter scope if the function to be approximated meets the Lipschitz condition. In [18], Tyukin and Prokhorov empirically investigated the feasibility of random basis approximators (equivalent to RVFL networks without direct links) for data modelling and showed that a supervisory mechanism is necessary to make RVFL networks applicable. Their experiments clearly indicate that RVFL networks fail to approximate a target function with a very high probability if the setting for the random parameters is improper. This phenomenon was further studied with mathematical justifications in [19]. There are two key parameters involved in the design of RVFL networks: the number of hidden nodes and the scope of random parameters. As one of the special features of this type of learner models, the first parameter associated with the modelling accuracy must be set largely. The second parameter is related to the approximation capability of the class of random basis functions. Obviously, these settings play a very important role in successfully building randomized learner models for real world applications. Our recent work reported in [20] reveals the infeasibility of RVFL networks, if they are incrementally built with random input weights and biases assigned in a fixed scope and its convergence rate satisfies certain conditions. Thus, further research on supervisory mechanisms with adaptive scope settings in the random assignment of the input weights and biases to ensure the universal approximation capability of incremental RVFL (IRVFL) networks is necessary and important. For the development of randomized learning techniques for neural networks, our prime and original contribution of this paper lies in the way of assigning the random parameters with inequality constraints and adaptively selecting the scope of the random parameters, ensuring the universal approximation property of the built randomized learner models. This work is the first to touch the basis of the implementation of the
2
random basis function approximation theory. It should be pointed out that our proposed SCNs cannot be regarded as a specific implementation of RVFL networks because of some essential distinctions in assigning these random weights and biases. Three algorithmic implementations of SCNs, namely Algorithm SC-I, SC-II and SC-III, are presented with the same supervisory mechanism for configuring the random parameters, but different methods for computing the output weights. Concretely, SC-I employs a constructive scheme to evaluate the output weights only for the newly added hidden node and keeps all of the previously obtained output weights unchanged; SC-II recalculates a part of the current output weights by solving a local least squares problem with a user specified shifting window size; and SC-III finds the output weights all together through solving a global least squares problem with the current learner model. In this paper, we employ a function approximation, three benchmark regression tasks and four handwritten digit recognition datasets to evaluate the learners’ performance. Experimental results demonstrate remarkable improvements on modelling accuracy and recognition rate, compared to existing methods such as Modified Quickprop (MQ) [8] and IRVFL [20]. The remainder of this paper is organized as follows: Section II briefly reviews constructive neural networks, and recalls the infeasibility of IRVFL networks. Section III details our proposed stochastic configuration networks with both theoretical analysis and algorithmic description. Section IV reports our simulation results, and Section V concludes this paper with some remarks on further work. The following notation is used throughout this paper. Let Γ := {g1 , g2 , g3 ...} be a set of real-valued functions, span(Γ) denote a function space spanned by Γ; L2 (D) denote the space of all Lebesgue measurable functions f = [f1 , f2 , . . . , fm ] : Rd → Rm defined on D ⊂ Rd , with the L2 norm defined as !1/2 m Z X kf k := |fq (x)|2 dx < ∞. (1) q=1
D
The inner product of θ = [θ1 , θ2 , . . . , θm ] : Rd → Rm and f is defined as m m Z X X hf, θi := hfq , θq i = fq (x)θq (x)dx. (2) q=1
q=1
D
In the special case that m = 1, for a real-valued function ψ : Rd R→ R defined on D ⊂ Rd , its L2 norm becomes kψk := ( D |ψ(x)|2 dx)1/2 R , while the inner product of ψ1 and ψ2 becomes hψ1 , ψ2 i = D ψ1 (x)ψ2 (x)dx. II. R ELATED W ORK Instead of training a learner model with a fixed architecture, the process of constructive neural networks starts with a small sized network then adds hidden nodes incrementally until an acceptable tolerance is achieved. This approach does not require any prior knowledge about the complexity of the network for a given task. This section briefly reviews some closely related work on constructive neural networks. Some comments on these methods are also given.
Given a target function f : Rd → R, suppose that we have alreadyPbuilt a SLFN with L − 1 hidden nodes, i.e., L−1 fL−1 (x) = j=1 βj gj (wjT x + bj ) (L = 1, 2, . . ., f0 = 0), and the current residual error, denoted as eL−1 = f − fL−1 , does not reach an acceptable tolerance level. The construction process is concerned with how to incrementally add βL , gL (wL and bL ) leading to fL = fL−1 + βL gL until the residual error falls into an expected tolerance . A. Constructive Neural Networks: Deterministic Methods In [7], Barron proposed a greedy approximation framework based on Jones’ Lemma [21]. The main result can be stated in the following theorem. Theorem 1 (Barron [7]). Given a Hilbert space G, suppose f ∈ conv(G) (closure of the convex hull of the set G), and kgk ≤ bg for each g ∈ G. Let bf > cf > sup kgk + kf k. For L = 1, 2, . . ., the sequence of approximations {fL } is deterministic and described as fL = αL fL−1 + (1 − αL )gL , where αL =
b2f b2f + kfL−1 − f k2
(3)
,
(4)
and gL is chosen such that the following condition holds hfL − f, gL − f i
0.
(9)
k=1
Then, the constructive neural network with random weights, fL , has no universal approximation capability, that is, √ (10) lim kf − fL k ≥ εkf k. L→+∞
Theorem 4 reveals that IRVFL networks may not share the universal approximation property, if the output weights are taken as (7) and the residual error sequence meets conditions (8) and (9). The consequence stated in Theorem 4 still holds if the output weights are evaluated using the least squares method. We state this interesting result in Theorem 5. Theorem 5. Let span(Γ) be dense in L2 space and ∀g ∈ Γ, 0 < kgk < bg for some bg ∈ R+ . Suppose that gL is randomly generated and the output weights are calculated by solving the global least square problem, i.e., [β1∗ , β2∗ , . . . , βL∗ ] = PL arg minβ kf − j=1 βj gj k. For sufficiently large L, if (8) and (9) hold for the residual error sequence ke∗L k (corresponding
4
to keL k in Theorem 4), the constructive neural network with random hidden nodes has no universal approximation capability, that is, lim kf −
L→+∞
L X
βj∗ gj k
≥
√
δL = εkf k.
he∗L−1 − β˜L gL , e∗L−1 − β˜L gL i he∗ , gL i2 = ke∗L−1 k2 − L−1 2 kgL k ≤ ke∗L−1 k2 .
m X
δL,q , δL,q = (1 − r − µL )keL−1,q k2 .
(13)
q=1
(11)
j=1
Proof. Simple computations can verify that the sequence ke∗L k2 is monotonically decreasing and converges. Indeed, let β˜L = he∗L−1 , gL i/kgL k2 , we have ke∗L k2
limL→+∞ µL = 0 and µL ≤ (1−r). For L = 1, 2, . . ., denoted by
If the random basis function gL is generated to satisfy the following inequalities: heL−1,q , gL i2 ≥ b2g δL,q , q = 1, 2, ..., m,
(14)
and the output weights are constructively evaluated by
≤
βL,q = (12)
Then, (11) can be easily obtained by following the proof of Theorem 4. Clearly, the universal approximation property is conditional to IRVFL networks regardless the methods used to evaluate the output weights. This happens for multiple reasons (e.g. the scope setting and/or an improper way of assigning the random parameters) that are hard to tell mathematically. In this paper, we propose a solution for constructing randomized learner models under a supervisory mechanism to ensure the universal approximation property.
heL−1,q , gL i , q = 1, 2, . . . , m. kgL k2
(15)
Then, we have limL→+∞ kf − fL k = 0. Proof. According to (15), it is easy to verify that {ke2L k} is monotonically decreasing. Thus, the sequence keL k is convergent as L → +∞. From (13), (14) and (15), we have
=
keL k2 − (r + µL )keL−1 k2 m X heL−1,q − βL,q gL , eL−1,q − βL,q gL i q=1 m X
−
(r + µL )heL−1,q , eL−1,q i
q=1
Pm
III. S TOCHASTIC C ONFIGURATION N ETWORKS The universal approximation property is fundamental to a learner model for data modelling. Logically, one cannot expect to build a neural network with good generalization but poor learning performance. Therefore, it is essential to share the universal approximation property for SCNs. This section details our proposed SCNs, including proofs of the universal approximation property and algorithm descriptions. Some remarks on algorithmic implementations are also given. A. Universal Approximation Property Given a target function f : Rd → Rm , suppose that a SCN with L − 1 hidden nodes has already been constructed, PL−1 T that is, fL−1 (x) = j=1 βj gj (wj x + bj ) (L = 1, 2, . . ., f0 = 0), where βj = [βj,1 , . . . , βj,m ]T . Denoted the current residual error by eL−1 = f − fL−1 = [eL−1,1 , . . . , eL−1,m ]. If keL−1 k does not reach a pre-defined tolerance level, we need to generate a new random basis function gL (wL and bL ) and evaluate the output weights βL so that the leading model fL = fL−1 + βL gL will have an improved residual error. In this paper, we propose a method to randomly assign the input weights and biases with a supervisory mechanism (inequality constraint), and provide three ways to evaluate the output weights of SCNs with L hidden nodes (i.e., after adding the new node). Mathematically, we can prove that the resulting randomized learner models incrementally built based on our stochastic configuration idea are universal approximators. Theorem 6. Suppose that span(Γ) is dense in L2 space and ∀g ∈ Γ, 0 < kgk < bg for some bg ∈ R+ . Given 0 < r < 1 and a nonnegative real number sequence {µL } with
= = ≤
2 q=1 heL−1,q , gL i (1 − r − µL )keL−1 k − kgL k2 Pm 2 q=1 heL−1,q , gL i δL − kgL k2 Pm 2 q=1 heL−1,q , gL i ≤ 0. (16) δL − b2g 2
Therefore, the following inequality holds: keL k2 ≤ rkeL−1 k2 + γL , (γL = µL keL−1 k2 ≥ 0).
(17)
Note that limL→+∞ γL = 0, and by using (17), we can easily show that limL→+∞ keL k2 = 0 which implies limL→+∞ keL k = 0. This completes the proof of Theorem 6. Theorem 6 provides a constructive scheme, i.e., (14) and (15), that can consequently lead to a universal approximator. Unlike the strategy that maximizes some objective functions in [8], the supervisory mechanism (14), which aims at finding appropriate wL and bL for a new hidden node, weakens the demanding condition as required to achieve the maximum value for heL−1 , gL i2 or heL−1 , gL i2 /kgL k2 (for the case m = 1). In fact, the existence of wL and bL satisfying (14) can be easily deduced because Ψ(w, b) = Pm he , gL i2 /kgL k2 > 0 is a continuous function L−1,q q=1 in the parameter space, and δL = (1 − r − µL )keL−1 k2 tends to zero if the value of r is selected to approach one. The proposed supervisory mechanism (14) not only makes it possible to randomly assign the hidden parameters, which performs efficiently in generating a new hidden node, but also enforces the residual error to be zero along with the constructive process.
5
Remark 4. Our proposed supervisory mechanism indicates that the random assignment of wL and bL should be constrained. The configuration of hidden parameters needs to be relevant to the given training samples, rather than totally relying on the distribution and/or scoping parameter setting of the random weights and biases. To the best of our knowledge, our attempt on the supervisory mechanism (14) is the first in the design of randomized learner models, and it fills the gaps between the scheme of solving a global nonlinear optimization problem and the strategy of the freely random assignment of the hidden parameters without any constraint. It is worth mentioning that Theorem 6 is still valid if the learning parameter r is unfixed and set based on an increasing sequence approaching one (always less than one). This makes the SC algorithm more flexible for the implementation of randomly searching wL and bL even when the residual error is smaller. It has been observed that finding an appropriate wL and bL for the newly added node becomes more challenging as the residual error becomes smaller. In this case, we need to set the value of r to be extremely close to one. It is obvious that βL = [βL,1 , . . . , βL,m ]T in Theorem 6 is analytically evaluated by βL,q = heL−1,q , gL i/kgL k2 and remain fixed for additional steps. This determination scheme, however, may result in a very slow convergence rate for the constructive process. Thus, we consider a recalculation scheme for the output weights, that is, once gj (j = 1, 2, . . . , L) have been generated according to (14), β1 , β2 , . . . , βL can be evaluated by minimizing the global residual error. Theorem 7 gives a result on the universal approximation property if the least squares method is applied to update the output weights in the proceeding manner. PL Let [β1∗ , β2∗ , . . . , βL∗ ] = arg minβ kf − j=1 βj gj k, e∗L = PL f − j=1 βj∗ gj = [e∗L,1 , . . . , e∗L,m ], and define intermediate values β˜L,q = he∗L−1,q , gL i/kgL k2 for q = 1, . . . , m and e˜L = e∗L−1 − β˜L gL , where β˜L = [β˜L,1 , . . . , β˜L,m ]T and e∗0 = f . Theorem 7. Suppose that span(Γ) is dense in L2 space and ∀g ∈ Γ, 0 < kgk < bg for some bg ∈ R+ . Given 0 < r < 1 and a nonnegative real number sequence {µL } with limL→+∞ µL = 0 and µL ≤ (1−r). For L = 1, 2, . . ., denoted by ∗ δL =
m X
∗ ∗ δL,q , δL,q = (1 − r − µL )ke∗L−1,q k2 .
(18)
q=1
If the random basis function gL is generated to satisfy the following inequalities: ∗ he∗L−1,q , gL i2 ≥ b2g δL,q , q = 1, 2, ..., m,
[β1∗ , β2∗ , . . . , βL∗ ] = arg min kf − β
Then, we have limL→+∞ kf − fL∗ k = 0.
ke∗L k2 − (r + µL )ke∗L−1 k2 ≤ =
k˜ eL k2 − (r + µL )ke∗L−1 k2 m X he∗L−1,q − β˜L,q gL , e∗L−1,q − β˜L,q gL i q=1 m X
−
(r + µL )he∗L−1,q , e∗L−1,q i
q=1 m X = (1 − r − µL )he∗L−1,q , e∗L−1,q i q=1 m X
−
2he∗L−1,q , β˜L,q gL i +
m X
˜
˜
hβL,q gL , βL,q gL i q=1 Pm ∗ 2 q=1 heL−1,q , gL i r − µL )ke∗L−1 k2 − 2 kgL k Pm 2 ∗ q=1 heL−1,q , gL i
q=1
=
(1 −
∗ = δL −
kgL k2 ∗ 2 q=1 heL−1,q , gL i
Pm ∗ ≤ δL −
b2g
≤ 0.
(21)
Using the same arguments in the proof of Theorem 6, we can obtain limL→+∞ ke∗L k = 0, which completes the proof of Theorem 7. The evaluation of the output weights in Theorem 7 is straightforward with the use of Moore-Penrose generalized inverse [23]. Unfortunately, this method is infeasible for large-scale data analytics. To solve this problem, we suggest a trade-off solution with the window shifting concept in the global least squares method (i.e., we only optimize a part of the output weights after the number of the hidden nodes exceeds to window size). This selective scheme for evaluating the output weights is meaningful and significant for dealing with large-scale data processing. Similar to the proof of Theorem 7, the resulting SCN shares the universal approximation property. Here, we state this result in Theorem 8 and omit the detailed proof. Theorem 8. Suppose that span(Γ) is dense in L2 space and ∀g ∈ Γ, 0 < kgk < bg for some bg ∈ R+ . Given 0 < r < 1 and a nonnegative real number sequence {µL } with limL→+∞ µL = 0 and µL ≤ (1 − r). For a given window size K and L = 1, 2, . . ., denoted by ∗ δL =
m X
∗ ∗ δL,q , δL,q = (1 − r − µL )ke∗L−1,q k2 .
(19)
If the random basis function gL is generated to satisfy the following inequalities: ∗ he∗L−1,q , gL i2 ≥ b2g δL,q , q = 1, 2, ..., m,
βj gj k.
(22)
q=1
and the output weights are evaluated by L X
Proof. It is easy to show that ke∗L k2 ≤ k˜ eL k2 = ke∗L−1 − 2 ∗ 2 2 ˜ βL gL k ≤ keL−1 k ≤ k˜ eL−1 k for L = 1, 2, . . ., so {ke∗L k2 } is monotonically decreasing and convergent. Hence, we have
(20)
(23)
and, as L ≤ K, the output weights are evaluated by
j=1
[β1∗ , β2∗ , . . . , βL∗ ] = arg min kf − β
L X j=1
βj gj k;
(24)
6
Otherwise (i.e.,L > K), the output weights will be selectively ∗ evaluated (i.e., keep β1∗ , . . . , βL−K unchanged and renew the left βL−K+1 , . . . , βL ) by ∗ ∗ [βL−K+1 , βL−K+2 , . . . , βL∗ ] = L−K X
arg min
βL−K+1 ,...,βL
kf −
βj∗ gj −
j=1
Then, we have limL→+∞ kf −
fL∗ k
L X
βj gj k. (25)
j=L−K+1
= 0.
B. Algorithm Description This subsection details the proposed SC algorithms, i.e., SC-I, SC-II and SC-III, which are associated with Theorems 6, 8 and 7, respectively. In general, the main components of our proposed SC algorithms can be summarized as follows: • Configuration of Hidden Parameters : Randomly assigning the input weights and biases to meet the constraint (14) ((19) or (23)), then generating a new hidden node and adding it to the current learner model. • Evaluation of Output Weights: Constructively or selectively determining the output weights of the current learner model. Given a training dataset with inputs X = {x1 , x2 , . . . , xN }, xi = [xi,1 , . . . , xi,d ]T ∈ Rd and its corresponding outputs T = {t1 , t2 , . . . , tN }, where ti = [ti,1 , . . . , ti,m ] ∈ Rm , i = 1, . . . , N . Denoted by eL−1 (X) = [eL−1,1 (X), eL−1,2 (X), . . . , eL−1,m (X)] ∈ RN ×m as the corresponding residual error vector before adding the L-th new hidden node, where eL−1,q (X) = [eL−1,q (x1 ), . . . , eL−1,q (xN )]T ∈ RN with q = 1, 2, . . . , m. Let T T hL (X) = [gL (wL x1 + bL ), . . . , gL (wL xN + bL )]T ,
(26)
which stands for the activation of the L-th hidden node for the input xi , i = 1, 2, . . . , N . The (current) hidden layer output matrix can be expressed as HL = [h1 , h2 , . . . , hL ]. In practice, the target function is presented as a collection of input-output data pairs. So βL,q = heL−1,q , gL i/kgL k2 (q = 1, . . . , m) becomes βL,q =
eT L−1,q (X) · hL (X) hT L (X)
· hL (X)
, q = 1, 2, . . . , m.
(27)
Recall that for a given window size K in SC-II, as L ≤ K, the suboptimal solution β ∗ = [β1∗ , β2∗ , . . . , βL∗ ]T ∈ RL×m can be computed using the standard least squares method, that is, β
Given inputs X = {x1 , x2 , . . . , xN }, xi ∈ Rd and outputs T = {t1 , t2 , . . . , tN }, ti ∈ Rm ; Set maximum number of hidden nodes Lmax , expected error tolerance , maximum times of random configuration Tmax ; Choose a set of positive scalars Υ = {λmin : ∆λ : λmax }; 1. Initialize e0 := [t1 , t2 , . . . , tN ]T , 0 < r < 1, Ω, W := [ ]; 2. While L ≤ Lmax AND ke0 kF > , Do Phase 1: Hidden Parameters Configuration (Step 3-17) 3. For λ ∈ Υ, Do 4. For k = 1, 2 . . . , Tmax , Do 5. Randomly assign ωL and bL from [−λ, λ]d and [−λ, λ]; 6. Calculate hL , ξL,q based on Eq. (26) and (28); Set µL = (1 − r)/(L + 1); 7. If min{ξL,1 , ξL,2 , ..., ξL,m } ≥ 0 P 8. Save wL and bL in W , ξL = m q=1 ξL,q in Ω; 9. Else go back to Step 4 10. End If 11. End For (corresponds to Step 4) 12. If W is not empty ∗ , b∗ that maximize ξ in Ω; 13. Find wL L L Set HL = [h∗1 , h∗2 , . . . , h∗L ]; 14. Break (go to Step 18); 15. Else randomly take τ ∈ (0, 1 − r), renew r := r + τ , return to Step 4; 16. End If 17. End For (corresponds to Step 3) Phase 2: Output Weights Determination ∗ ∗T ∗ 18. Calculate βL,q = (eT L−1,q ·hL )/(hL ·hL ), q = 1, 2, . . . , m; 19. βL = [βL,1 , . . . , βL,m ]T ; 20. eL = eL−1 − βL h∗L ; 21. Renew e0 := eL ; L := L + 1; 22. End While ∗ ], b∗ = [b∗ , . . . , b∗ ]. 23. Return β1 , . . . , βL , ω ∗ = [ω1∗ , . . . , ωL 1 L
As L > K, a portion of the output weights can be evaluated by † β window = arg min kHK β − T k2F = HK T, β
(30)
where β window consists of the most recent βL−K+1 , . . . , βL , HK is composed of the last K columns of H, i.e., HK = [hL−K+1 , . . . , hL ], and the left (previous) β1 , . . . , βL−K remain unchanged. It is easy to see that SC-II and SC-III become consistent once K ≥ Lmax . Algorithm SC-II Given same items in Algorithm SC-I. Set window size K < Lmax .
For the sake of brevity, we introduce a set of variables ξL,q , q = 1, 2, ..., m and use them in algorithm descriptions (pseudo codes): 2 eT (X) · h (X) L L−1,q ξL,q = hT L (X) · hL (X) −(1 − r − µL )eT (28) L−1,q (X)eL−1,q (X).
β ∗ = arg min kHL β − T k2F = HL† T,
Algorithm SC-I
(29)
where HL† is the Moore-Penrose generalized inverse [23] and k · kF represents the Frobenius norm.
1. Initialize e0 := [t1 , t2 , . . . , tN ]T , 0 < r < 1, Ω, W := [ ]; 2. While L ≤ Lmax AND ke0 kF > , Do 3. Proceed Phase 1 of Algorithm SC-I; 4. Obtain HL = [h∗1 , h∗2 , . . . , h∗L ]; 5. If L ≤ K ∗ ]T = H † T ; 6. Calculate β ∗ = [β1∗ , β2∗ , . . . , βL L 7. Else ˜ K =HL (:,1 : L−K), HK =HL (:,L−K +1 : L); 8. Set H ∗ 9. Retrieve β previous = [β1∗ , β2∗ , . . . , βL−K ]T ; † ˜ K β previous ); 10. Calculate β window = HK (T − H previous ∗ ]T := β 11. Let β ∗ = [β1∗ , β2∗ , . . . , βL ; β window 12. End If 13. Calculate eL = HL β ∗ − T ; 14. Renew e0 := eL ; L := L + 1; 15. End While ∗ , ω ∗ = [ω ∗ , . . . , ω ∗ ] and b∗ = [b∗ , . . . , b∗ ]. 16. Return β1∗ , . . . , βL 1 1 L L
7
Algorithm SC-III Given the same items in Algorithm SC-I. 1. 2. 3. 4. 5. 6. 7. 8. 9.
Initialize e0 := [t1 , t2 , . . . , tN ]T , 0 < r < 1, Ω, W := [ ]; While L ≤ Lmax AND ke0 kF > , Do Proceed Phase 1 of Algorithm SC-I; Obtain HL = [h∗1 , h∗2 , . . . , h∗L ]; ∗ ]T := H † T ; Calculate β ∗ = [β1∗ , β2∗ , . . . , βL L Calculate eL = HL β ∗ − T ; Renew e0 := eL ; L := L + 1; End While ∗ , ω ∗ = [ω ∗ , . . . , ω ∗ ] and b∗ = [b∗ , . . . , b∗ ]. Return β1∗ , . . . , βL 1 1 L L
Remark 5. Our proposed SC algorithms exhibit more flexibility in dealing with over-fitting. In practice, the whole algorithmic procedures of the proposed SC algorithms can be deployed on two separate datasets, i.e., a training set and a validation set, leading to two distinct sequences of residual errors. Then, it is straightforward to find the most appropriate architecture of the learner model without over-fitting, by referring to the lowest point of the residual error curve over the validation set. There are two methods to recalculate the output weights after pruning the “redundant nodes”: (i) retain the previously obtained output weights without any recalculation; and (ii) merge the training and validation sets and use them to re-evaluate the output weights. Compared to optimization-based learning algorithms, SC algorithms provide more flexibility and confidence for end-users to quickly rebuild a learner model when it goes to over-fitting, instead of totally reproducing a brand new learner model or keeping all the weights of each iteration for model retrieval. Remark 6. In the proposed SCN framework, the scope control set Υ := {λ1 : ∆λ : λmax } plays an important role for adaptively determining the range setting of the random parameters w and b. As indicated in Theorem 6, although the randomness is beneficial for the fast configuration of the newly added hidden node, the search area (termed as support of the random parameters, SRP) is data dependent and should not be rigidly fixed. In fact, the λ ∈ Υ varies during the course of adding new hidden nodes, which will be demonstrated later in our simulations. 7. It is worth mentioning that a larger value of PRemark m 2 2 he q=1 L−1,q , gL i /kgL k would probably lead to a faster decrease in the residual error. This is the reason why we conduct Tmax times of random configuration for wL and bL and finally return a pair of w∗ and b∗ with the largest ξL , as specified in Phase 1 of Algorithm SC-I. Obviously, such an algorithmic operation helps in building compact SCNs. As the constructive process proceeds, difficulties involved in the random configuration for w and b become increasingly higher when the residual error becomes smaller. To overcome this, we take the time-varying value of r so that P m T q=1 (1 − r − µL )eL−1,q (X)eL−1,q (X) will approach zero and consequently create more opportunities to make ξL ≥ 0. IV. P ERFORMANCE E VALUATION A. Data Regression This section reports some simulation results over four regression problems, including a function approximation and
three real-world modelling tasks. Performance evaluation is undertaken on learning, generalization and efficiency (computing time and the eventual number of hidden nodes required for achieving an expected training error tolerance). Comparisons of our SC algorithms, Modified Quickprop (MQ) [8] and IRVFL [20] are carried out. In this study, the sigmoidal activation function g(x) = 1/(1 + exp(−x)) is used. Two metrics are used in our comparative studies, i.e., modelling accuracy and efficiency. The first one is the widely used Root Mean Squares Error (RMSE). The second is the time spent on building the learner model to achieve the expected error tolerance. 1) Data Sets: Four datasets are employed in our simulation studies, namely a function approximation example, and three real-world regression datasets stock, concrete and compactiv downloaded from the KEEL (Knowledge Extraction based on Evolutionary Learning) dataset repository1 . •
DB 1 was generated by a real-valued function defined on [0, 1] [18], [20]: 2
2
2
f (x) = 0.2e−(10x−4) +0.5e−(80x−40) +0.3e−(80x−20) .
•
•
•
The training dataset has 1000 points randomly generated from the uniform distribution [0,1]. The test set, of size 300, is generated from a regularly spaced grid on [0,1]. Data provided in stock are daily stock prices from January 1988 to October 1991, for ten aerospace companies. The task is to approximate the price of the 10th company by using the prices of the rest. The whole data set (DB2) contains 950 observations with nine input variables and one output variable. We randomly selected 75% of the samples as the training set while the test set consists of the remaining 25%. The third regression task (concrete) is to find the highly nonlinear functional relationship between concrete compressive strength and the ingredients (input features) namely cement, blast furnace slag, fly ash, water, super plasticizer, coarse aggregate, fine aggregate, and age. The whole data set (DB 3) contains 1020 instances. Both the training and test sets are formed in the same manner as DB2. The fourth dataset (compactiv) comes from a real world application, that is, a computer activity dataset describing the portion of time that CPUs run in user-mode, based on 8192 computer system activities collected from a Sun SPARCstation 20/712 with 2 CPUs and 128 MB of memory running in a multi-user university department. Each system activity is evaluated using 21 system measures (as input features). Both the training and test data sets are formulated in the same way as for DB2 and DB3.
Both the input and output attributes are normalized into [0,1], and all the results reported in this paper take averages over 100 independent trials. The maximum times of random configuration Tmax is set as 200. For the other parameters such as the maximum number of hidden nodes Lmax , the 1 KEEL:
http://www.keel.es/
8
expected error tolerance and the index r, we specify their corresponding settings later. 2) Results: Table I shows the training and test results of MQ, IRVFL, SC-I, SC-II and SC-III on DB 1-4, in which average values and standard deviations are reported on the basis of 100 independent trials. In particular, we set the learning rate in MQ as 0.05 and its maximum iterative number as 200 for all the regression tasks. In IRVFL, the random parameters (w and b) are taken from the uniform distribution over [-1,1]. Clearly, from Table I, SC-I, SC-II and SC-III outperform the other methods in terms of both learning and generalization. For each setting of L, the training errors of MQ and IRVFL are larger than SC algorithms, in which SC-III demonstrates the best performance compared to SC-I and SC-II for both training and test datasets. For DB 2-4, both MQ and SC algorithms lead to sound performances in comparison to the results obtained from IRVFL. For efficiency comparisons, it can be observed from Table II that MQ, IRVFL and SC-I all fail in achieving the expected training error tolerance ( = 0.05) within an acceptable time period for DB1, whilst both IRVFL and SC-I could not successfully achieve the given training error for DB2. Here, we only report the results for DB1 and DB2 as the other two case studies share similar consequences. First, the missing values in Table II (marked as ‘-’) for time cost and test performance are caused by the extremely slow learning rate for these scenarios. With some practical consideration that if certain patience parameter mentioned in [8] is given, the learning phase of MQ, IRVFL and SC-I algorithm will be terminated to avoid meaningless time cost. Consequently, it seems impossible for these methods to meet the expected error tolerance in a reasonable time slot. The proposed SC-II and SC-III algorithms, however, perform quite well and achieve the specified training error bound with a few hidden nodes. On the other hand, compared to the MQ method, SC-II and SC-III algorithms generate a lower number of hidden nodes and demonstrate better generalization performance. The advantages of the proposed SC algorithms are clearly visible from Fig. 1 and Fig. 2, in which the curves are plotted based on the average values over 100 independent runs. The following observations can be made from these results. • In Fig. 1, it can be observed that the error decreasing rates of our proposed SC algorithms are much higher than the other two methods. At the same time, their corresponding test errors continue to decrease when adding new hidden nodes. Furthermore, similar to the results reported in Table I, SC-II and SC-III perform much better than SC-I, which has a slower decreasing rate (rather than MQ and IRVFL). It should be noted that window size K = 15 is used in SC-II, so that the error decreasing trend of SC-II and SC-III should be consistent with each other when L ≤ 15. This fact is clearly illustrated in Fig. 1. • For MQ and IRVFL algorithms on DB 1, both the training and test errors stabilize at a certain level but they are still unacceptable (the real training RMSE is larger than 0.1). This extremely slow decreasing rate in learning makes it impossible to build a learner model with a time constraint, even when L takes a larger value. In comparison, the
•
improved decreasing rate of SC-I is evident compared to MQ after adding 150 hidden nodes. In Fig. 2, MQ, SC-II, and SC-III converge faster than the other two algorithms, of which IRVFL exhibits the slowest decreasing rate in both training and test phases. Although MQ in this case outperforms SC-I, its error decreasing rate is slower than SC-II and SC-III as L ≤ 50. There is a negligible difference among the curves of SC-II, SC-III and MQ when L > 50. As a whole, the error curves of SC-III always stay at the bottom, which correspond to the numerical results reported in Table II.
Table III and Table IV report some results on the robustness of the modelling performance with respect to the window size K in SC-II. It should be noted that we specify the number of hidden nodes being added in order to perform robustness analysis. Furthermore, for the purpose of examining efficiency, we set the training error tolerance for each task (the same as in Table II) and recorded the real time cost and number of hidden notes required for achieving that tolerance. All of the results reported in Table III and Table IV are averaged over 100 independent runs. For DB 1, seven cases (K = 5, 10, 15, 20, 25, 30, 35) are considered and the number of hidden nodes is set as 35. It can been clearly seen that both training and test performance become better along with increasing the value of K. Importantly, small K values may lead to inferior results, for example, the training and test RMSEs with K = 5, K = 10 and K = 15 are out of the tolerance level. On the other hand, the efficiency of SC-II becomes higher as the value of K increases. Apart from the scenarios of K = 5, K = 10 and K = 15, the difference among the remaining cases can be ignored when the number of hidden nodes is about 20 while the whole time cost is around 0.26s (see Table III). For DB 2, five cases (K = 5, 10, 15, 20, 25) are examined while the number of hidden nodes is set as 25. Similarly, all of the results, except for K = 5 and K = 10, are comparable. In particular, for K = 15, 20, 25, the time cost and number of hidden nodes required to achieve the error tolerance ( = 0.05) is around 0.2s and 16, respectively. Fig. 3 depicts the modelling performance (for the test dataset) of IRVFL, SC-I, SC-II and SC-III on DB 1, respectively. Firstly, the importance of the scale factor λ in the activation function, which directly determines the range of random parameters, is examined by performing different settings. Secondly, the infeasibility of IRVFL is illustrated in comparison with the proposed SC algorithms, where the effectiveness and benefits from the supervisory mechanism of SCNs can be clearly observed. The maximum number of the hidden nodes to be added is set as 100 for these four methods examined here. From Fig. 3(a), (b), (c), it is apparent that IRVFL networks perform poorly. In fact, we attempted to add 20,000 hidden nodes to see the performance of IRVFL networks with different values of λ. Unfortunately, the final result is very similar to that depicted in Fig. 3(c), which is aligned with our findings in [20]. In comparison, the test performance of SC-I shown in Fig. 3(d) is much better than IRVFL, which corresponds to the records in Table I and
9
TABLE I P ERFORMANCE C OMPARISON OF MQ, IRVFL AND SC-I, SC-II AND SC-III ON DB 2, DB3 AND DB4. K = 15 WAS USED IN SC-II.
Data Sets
DB1
DB2
DB3
DB4
Training L = 25 L = 50 0.1031±0.0001 0.1030±0.0001 0.1630±0.0008 0.1626±0.0005 0.0927±0.0020 0.0887±0.0018 0.0435±0.0061 0.0366±0.0049 0.0332±0.0065 0.0097±0.0036 0.0624±0.0058 0.0410±0.0014 0.2121±0.0260 0.1853±0.0248 0.0963±0.0032 0.0881±0.0026 0.0437±0.0014 0.0395±0.0008 0.0409±0.0010 0.0327±0.0007 0.1096±0.0042 0.0910±0.0014 0.2045±0.0189 0.1929±0.0135 0.1401± 0.0027 0.1358±0.0022 0.0994±0.0018 0.0943±0.0011 0.0969±0.0013 0.0835±0.0012 0.0840±0.0052 0.0600±0.0071 0.2002±0.0391 0.1924±0.0283 0.1207±0.0036 0.1137±0.0038 0.0760±0.0034 0.0579±0.0029 0.0678±0.0038 0.0394±0.0016
Algorithms MQ IRVFL SC-I SC-II SC-III MQ IRVFL SC-I SC-II SC-III MQ IRVFL SC-I SC-II SC-III MQ IRVFL SC-I SC-II SC-III
Test L = 25 0.1011±0.0003 0.1622±0.0012 0.0912±0.0021 0.0411±0.0064 0.0308±0.0060 0.0611±0.0061 0.2057±0.0268 0.0925±0.0031 0.0427±0.0017 0.0403±0.0012 0.0999±0.0048 0.2109±0.0214 0.1315±0.0026 0.0925±0.0029 0.0898±0.0020 0.0848±0.0053 0.1958±0.0386 0.1169±0.0038 0.0773±0.0039 0.0697±0.0044
L = 50 0.1011±0.0003 0.1617±0.0008 0.0870±0.0019 0.0337±0.0049 0.0100±0.0033 0.0407±0.0017 0.1787±0.0237 0.0851±0.0024 0.0391±0.0010 0.0347±0.0012 0.0869±0.0021 0.1983±0.0166 0.1258±0.0017 0.0874±0.0018 0.0850±0.0025 0.0624±0.0075 0.1882±0.0281 0.1105± 0.0040 0.0593±0.0036 0.0418±0.0021
TABLE II E FFICIENCY COMPARISON OF MQ, IRVFL AND SC-I, SC-II AND SC-III ON DB 1 AND DB2. W INDOW SIZE K = 15 WAS USED IN SC-II.
Data Sets
DB1
DB2
Algorithms MQ IRVFL SC-I SC-II SC-III MQ IRVFL SC-I SC-II SC-III
Efficiency Time (s) 0.4029±0.1690 0.2737±0.0719 1.3157±0.6429 0.2193±0.0454 0.2083±0.0317
0.25
Test 0.0463±0.0021 0.0435±0.0047 0.0532±0.0175 0.0488±0.0016 0.0483±0.0020
0.25 IRVFL MQ SC-I SC-II SC-III
0.2
0.15
0.1
0.05
IRVFL MQ SC-I SC-II SC-III
0.2
Test RMSE
Training RMSE
( = 0.05) Nodes 26.6700±8.7514 19.9000±3.6167 34.1809±3.5680 16.7200±2.1513 16.3600±1.5987
0.15
0.1
0.05
0
0 0
50
100 Number of Nodes: L
(a)
150
0
50
100
150
Number of Nodes: L
(b)
Fig. 1. Performance of Modified QuickProp (MQ), IRVFL, SC-I, SC-II and SC-III with 150 additive nodes on DB 1: (a) Average training RMSE and (b) Average test RMSE
10
0.6
0.3 IRVFL MQ SC-I SC-II SC-III
0.5
0.25
0.2 Test RMSE
0.4 Training RMSE
IRVFL MQ SC-I SC-II SC-III
0.3
0.15
0.2
0.1
0.1
0.05
0
0 0
50
100
150
0
Number of Nodes: L
50
100
150
Number of Nodes: L
(a)
(b)
Fig. 2. Performance of Modified QuickProp (MQ), IRVFL, SC-I, SC-II and SC-III with 150 additive nodes on DB 2: (a) Average training RMSE and (b) Average test RMSE TABLE III P ERFORMANCE OF SC-II WITH DIFFERENT WINDOW SIZES ON DB 1. Window Size K=5 K=10 K=15 K=20 K=25 K=30 K=35
Training (L = 35) Mean STD 0.0682 0.0049 0.0530 0.0061 0.0439 0.0062 0.0364 0.0069 0.0325 0.0073 0.0267 0.0056 0.0248 0.0069
Test (L = 35) Mean STD 0.0672 0.0049 0.0516 0.0064 0.0416 0.0068 0.0338 0.0064 0.0303 0.0067 0.0252 0.0049 0.0233 0.0062
Efficiency ( = 0.05) Time(s) Nodes 1.36 99.99 0.72 50.85 0.34 25.07 0.28 21.21 0.26 20.29 0.25 19.62 0.25 19.75
TABLE IV P ERFORMANCE OF SC-II WITH DIFFERENT WINDOW SIZES K ON DB 2. Window Size K=5 K=10 K=15 K=20 K=25
Training (L = 25) Mean STD 0.0613 0.0039 0.0492 0.0024 0.0443 0.0014 0.0420 0.0012 0.0411 0.0010
the error changing curves in Fig. 1, respectively. In addition, both SC-II and SC-III outperform SC-I as the recalculation of βL contributes greatly to the construction of a universal approximator with a much more compact structure. 3) Discussions: This section briefly explains why the performance of our proposed SC algorithms is better than the MQ and IRVFL algorithms. Then, further remarks are provided to highlight some of the characteristics of SCNs. Although the universal approximation property can be ensured by maximizing an objective function (see Section II), the Modified Quickprop (MQ) algorithm, aiming at iteratively finding the appropriate hidden parameters (w and b) for the new hidden node, may face some obstacles in proceeding the optimization of certain objective function. In essence, it is a gradient-ascent algorithm with consideration of second-order information, and may be problematic when it is searching in a plateau over the error surface. That is to say, the well-trained parameters (w and b) at certain iterative steps
Test (L = 25) Mean STD 0.0601 0.0037 0.0493 0.0027 0.0436 0.0020 0.0412 0.0015 0.0405 0.0013
Efficiency ( = 0.05) Time(s) Nodes 0.88 66.42 0.31 23.48 0.21 16.62 0.20 16.16 0.20 16.35
fall into a region where both the first and second order of derivatives of the objective function are nearly zero, and learning basically reaches a halt and can be terminated by a patience parameter for the purpose of saving time. The overall speed for parameter updating might be quite slow. This issue seems severe for regression tasks. For implementation, MQ works less flexibly and seems very likely to fail in constructing a universal approximator, not to mention some inherent flaws including the selection of initial weights, the learning rate, as well as a reasonable stopping criterion. Furthermore, there will be much computational burden in dealing with large-scale problems. The unique feature of the proposed SCNs is the use of randomness in the learner model design with a supervisory mechanisms. Compared with the Modified Quickprop algorithm, SC-II and SC-III are computationally manageable and easy to implement. In addition, the output weights βL in IRVFL are analytically ∗ calculated by eL−1 (X)T ·h∗L /(h∗T L ·hL ) and then remain fixed
11
1 Target Output Model Output 0.8
1
1
0.8
0.8
0.6
0.6 f(x)
f(x)
f(x)
0.6 0.4
0.4
0.4 0.2
0.2
0
0
0.2 0
-0.2 0
0.2
0.4
0.6
0.8
1
-0.2 0
0.2
0.4
x
0.8
1
0
(b) IRVFL, λ = 100
0.8
0.8
0.6
0.6
0.6 f(x)
0.8
f(x)
1
0.4
0.4 0.2
0.2
0
0
0
-0.2 0.4
0.6
0.8
1
1
0.8
1
-0.2 0
0.2
0.4
x
0.6
0.8
1
0
0.2
0.4
x
0.6 x
(e) SC-II, K = 15
(d) SC-I
0.8
0.4
0.2
-0.2
0.6
(c) IRVFL, λ = 200
1
0.2
0.4 x
1
0
0.2
x
(a) IRVFL, λ = 1
f(x)
0.6
(f) SC-III
Fig. 3. Approximation performance of IRVFL (with different setting of λ), SC I, SC-II and SC-III on DB 1 0.25
0.25 sigmiod tanh sin cos gaussian
IRVFL
0.15
IRVFL
0.2
Test RMSE
Training RMSE
0.2
SC-I
0.1
0.05
sigmiod tanh sin cos gaussian
0.15 SC-I
0.1
0.05
0 0
50
100
150
0 0
Number of Nodes: L
50
100
150
Number of Nodes: L
(a) Training
(b) Test
Fig. 4. Comparison between SC-I and IRVFL with various types of activation function
in subsequent phases. This constructive approach to compute the output weights, together with the freely random assignment of the hidden parameters, may result in a very slow decreasing rate of the residual error sequence, and as a result the learner model may fail to approximate the target function. Although this evaluation method for the output weights is also applied for SC-I, the whole construction process works successfully in building a universal approximator. In the end, some of the features of our proposed SCNs are summarized as follows: •
The selection of r that to some extent is directly associated with the error decreasing speed is quite important. In our experiments, its setting is unfixed
•
and based on an increasing sequence starting from 0.9 and approaching 1. This makes it possible for the implementation of randomly searching wL and bL . As the constructive process proceeds, the current residual error becomes smaller which makes the configuration task on wL and bL more challenging (i.e., need more search attempts as L becomes larger). From our experience, this difficulty can be effectively alleviated by setting r extremely close to 1. In SC algorithms, λ may keep varying during the learning course. This implies that the scope setting for w and b should not be fixed. In our simulations, λ is automatically
12
•
•
•
selected from a given set {1, 5, 15, 30, 50, 100, 150, 200}, along with the setup of r. This strategy makes it possible for the built learner to possess multi-scale random basis functions rather than RVFL networks (reported in [19]) or IRVFL, and can inherently improve the probability of finding certain appropriate settings of the input weights and biases. In the implementation, Tmax controls the pool size of the random basis function candidates. It is related to both opportunity and efficiency, so we need to choose this parameter carefully with a trade-off in mind. This operation aims at finding the most appropriate pairs of w and b which returns the largest ξL (see Algorithm SC-I). Roughly speaking, this manipulation can be viewed as an alternative to find a ‘suboptimal’ solution of the maximization problem discussed in [8]. Based on the theoretical analysis given in Section IV, many types of activation functions can be used in SCNs, such as Gaussian, sine, cosine and tanh function. Fig. 4 shows both the training and test performance of SC-I and IRVFL with different choices of activation functions. It can be clearly seen that our SC-I algorithm is more efficient and effective than IRVFL as the random assignment of w and b from [-1,1] is unworkable and useless, no matter what kind of activation function is employed in the hidden layer. As a compromise between SC-I and SC-III, SC-II provides more flexibility in dealing with large-scale data modelling tasks, by right of its distinctive window-based recalculation for the output weights. Through optimizing a part of the output weights according to a given window size, SC-II not only outperforms SC-I, because the evaluation manner of the output weights in SC-I is not based on any objectives and no further adjustments are provided throughout the construction process; but also has some advantages over SC-III in building universal approximators for coping with large-scale data analytics. In practice, when the number of training samples is extremely large and L > K, the computation burden of † calculating HK in SC-II is far less than calculating HL† in SC-III. Furthermore, the robustness of SC-II with regard to the window size is favourable once K is assigned from a reasonable range, as shown in Table III and Table IV. It will be meaningful and interesting to find more characteristics of SC-II in the future.
B. Handwritten Digit Recognition In this part, we compare our proposed SCNs with RVFL networks on data classification problems. Four benchmark handwritten digit databases are employed in the comparison. Here, SCN-based classifiers are built by Algorithm SC-III with Tmax = 5, λ = {1, 5, 15, 30, 50, 100, 150, 200, 250}, r = {1 − 10−j }7j=2 . The scope for randomly assigning the hidden parameters (input weights and biases) in RVFL networks is set as [-1,1]. 1) Handwritten Digit Databases: • Bangla, Devnagari, and Oriya handwritten digit databases are provided by the ISI (Indian Statistical Institute,
Kolkata)2 , as a benchmark resource for optical character recognition of Indian scripts. In detail, Bangla is the second-most popular language and script of the Indian subcontinent and the fifth-most popular language in the world. The Bangla handwritten database consists of 23392 samples (grayscale images of digits 0-9) written by 1106 persons, in which 19392 samples are used for training while the remaining 4000 images are for testing (400 per class). Most of the images are collected from 465 mail pieces and 268 job application forms in Indian. Devanagari, which is part of the Brahmic family of scripts of India, Nepal, Tibet, and South-East Asia, is the most popular language and script of India. The Devnagari digit database consists of 22556 samples (18794 for training and 3762 for testing) from 1049 persons. The majority of digit images in the database are retrieved from 368 mail pieces, and 274 job application forms. Oriya is developed from Kalinga, which is one of the many descendants of the Brahmi script of ancient India. The Oriya handwritten database consists of 5970 samples from 356 persons. Images in this database were formed from 105 mail pieces and 166 job application forms. The whole dataset has been split into a training set comprising 4970 samples and a test set comprising 1000 samples (100 per class). • The CVL database has received careful attention in some pattern recognition competitions3 . The complete digit dataset consists of 10 classes (0-9) with 3,578 samples per class. 7,000 samples (700 digits per class) of 67 writers have been selected as the training set while an equal sized set of 60 writers is used as the validation set. It should be noted that we used the 14000 digits of these 127 writers as the training set in our experiment. The test set consists of 2,178 digits per class resulting in 21,780 evaluation samples of the remaining 176 writers. The four databases are summarized in Table V. All the images of these four databases have been converted to grayscale and a normalized size (28 × 28) by applying similar preprocessing procedures as those used for the MNIST database4 , that is, scaled to 20x20 by preserving the aspect ratio and then the centroid of the image is placed on the center of a standard plane of size 28x28. Some sample images are shown in Fig. 5. TABLE V S TATISTICS OF HANDWRITTEN DIGIT DATABASES Name
Number of Images
Training
Test
Bangla Devnagari Oriya CVL
23392 22556 5970 35780
19392 18794 4970 14000
4000 3762 1000 21780
2) Results and Discussion: In this comparison, different numbers of hidden nodes, e.g, L = 500 : 500 : 4000 for 2 http://www.isical.ac.in/
ujjwal/download/database.html
3 http://www.caa.tuwien.ac.at/cvl/category/research/cvl-databases/ 4 http://yann.lecun.com/exdb/mnist/
13
0.94 0.92
0.9
0.9
0.88
0.88
0.86
Recognition Rate
Recognition Rate 500
1000 1500 2000 2500 3000 3500 4000 4500 Number of Nodes: L
(a) Bangla:Training
500
1000 1500 2000 2500 3000 3500 4000 4500 Number of Nodes: L
(b) Devnagari:Training 0.88
1200
1400
0
0.88 0.86
0.84
0.84 0.82
0.82 0.8 0.78
0.82
0.82
0.78
0.76
0.8
0.8
0.76
1000 1500 2000 2500 3000 3500 4000 4500 Number of Nodes: L
(e) Bangla:Test
0
500
1000 1500 2000 2500 3000 3500 4000 4500 Number of Nodes: L
(f) Devnagari:Test
SCN RVFL
0.86
0.8
500
1000 1500 2000 2500 3000 3500 4000 4500 Number of Nodes: L
(d) CVL:Training
0.84
0
500
0.88 SCN RVFL
0.86 Recognition Rate
Recognition Rate
Recognition Rate
600 800 1000 Number of Nodes: L
(c) Oriya:Training
0.9
0.84
400
0.9 SCN RVFL
0.92
0.9
0.86
0.8 200
0.94 SCN RVFL
0.9
0.85
0.8 0
0.92
0.88
0.9
0.85
0.86 0
0.95
Recognition Rate
0.92
SCN RVFL
0.95
0.96
0.94
1 SCN RVFL
SCN RVFL
0.98
0.96 Recognition Rate
1
1 SCN RVFL
Recognition Rate
1 0.98
0.74 200
400
600 800 1000 Number of Nodes: L
(g) Oriya:Test
1200
1400
0
500
1000 1500 2000 2500 3000 3500 4000 4500 Number of Nodes: L
(h) CVL:Test
Fig. 6. Comparison of recognition rates between SCN and RVFL.
Bangla
for the scope of the random parameters have no significant improvement on their performance or even obtain much worse results than that with the scope [-1,1]. V. C ONCLUSION AND F URTHER W ORK
Devnagari
Oriya
CVL
Fig. 5. Samples of handwritten digits from Bangla, Devnagari, Oriya, and CVL database.
Bangla, Devnagari, CVL, while L = 200 : 200 : 1400 for Oriya, were used in the SCN-based and RVFL-based classifier, and the experiment with each setting of L was conducted independently 10 times. For these four handwritten digit databases, both the training and test performance (with mean and standard deviation of the recognition rate) are detailed in Fig. 6. It is clearly shown that our SCNs perform much better than the RVFL networks on each setting of the network architecture. It should be mentioned that we do not plot out the associated results when enlarging the value of L until over-fitting. Based on our observations after running many simulations with larger L, the SCN-based classifier always outperforms RVFL-based one even when over-fitting occurs. Furthermore, RVFL-based classifiers with alternative settings
The proposed stochastic configuration idea in this paper is original, innovative and significant for developing randomized learning techniques for neural networks. The supervisory mechanism reported in this work clearly reveals that the assignment of random parameters of learner models should be data dependent to ensure the universal approximation property of the resulting randomized learner model. To the best of our knowledge on the working topic, this is the first time to technically and substantially improve some basic understandings on the randomized methods for training neural networks over the last two decades. It is believed that the proposed framework will form a solid basis for further developing the randomized learning algorithms, which are becoming a pathway to access big data analytics. Pm From implementation perspective, ξL = q=1 ξL,q > 0 can be used in the SC algorithms, and a batch manner to create the random parameters (i.e., a collection of random basis functions are generated by using the proposed supervisory mechanism) can greatly speed up the learning process. It should be mentioned that the region characterized by the constraints (14) in the random parameter space can be defined as a stochastic configuration support (SCS), which mainly contributes to the reduction of the approximation error. There are many interesting applications of the proposed SCNs in deep learning, such as stochastic configuration autoencoders (SCAs) for dimensionality reduction and learning representation [24]–[27]. Some immediate extensions of the proposed SCNs have been done, including deep version of SCNs, robust SCNs and SCN ensemble [28]–[30]. Further work, such as the development of a recurrent SCNs [10], Bayesian SCNs [31], online and/or distributed SCNs
14
[32] are being expected. Except for these studies mentioned above, the following researches are anticipated: robustness analyses of the modelling performance with respect to some key parameters involved in the design of SCNs; and a guideline for selecting the basis/activation function so that the modelling performance can be maximized; and explorations for real-world applications. ACKNOWLEDGMENT The authors would like to thank Professor Witold Pedrycz, who read the original manuscript and provided some insightful feedback during his visiting at La Trobe University in 2017. R EFERENCES [1] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 1989. [2] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989. [3] J. Park and I. W. Sandberg, “Universal approximation using radial basis function networks,” Neural Computation, vol. 3, no. 2, pp. 246–257, 1991. [4] T. Chen and H. Chen, “Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems,” IEEE Transactions on Neural Networks, vol. 6, no. 4, pp. 911–917, 1995. [5] R. Hecht-Nielsen, “Theory of the backpropagation neural network.” Neural Networks, vol. 1, no. 1, pp. 445–448, 1988. [6] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986. [7] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, 1993. [8] T. Y. Kwok and D. Y. Yeung, “Objective functions for training new hidden units in constructive neural networks,” IEEE Transactions on Neural Networks, vol. 8, no. 5, pp. 1131–1148, 1997. [9] M. W. Mahoney, “Randomized algorithms for matrices and data,” Foundations and Trends in Machine Learning, vol. 3, no. 2, pp. 123–224, 2011. [10] M. LukoˇseviˇcIus and H. Jaeger, “Reservoir computing approaches to recurrent neural network training,” Computer Science Review, vol. 3, no. 3, pp. 127–149, 2009. [11] S. Scardapane and D. Wang, “Randomness in neural networks: An overview,” WIREs Data Mining and Knowledge Discovery, vol. e1200. doi: 10.1002/widm.1200, 2017. [12] D. Wang, “Editorial: Randomized algorithms for training neural networks,” Information Sciences, vol. 364, pp. 126–128, 2016. [13] D. S. Broomhead and D. Lowe, “Multi-variable functional interpolation and adaptive networks,” Complex Systems, vol. 2, pp. 321–355, 1988. [14] Y. H. Pao and Y. Takefuji, “Functional-link net computing: theory, system architecture, and functionalities,” Computer, vol. 25, no. 5, pp. 76–79, 1992. [15] W. F. Schmidt, M. A. Kraaijveld, and R. P. Duin, “Feedforward neural networks with random weights,” in Proceedings of the 11th International Conference on Pattern Recognition, 1992, pp. 1–4. [16] B. Igelnik and Y. H. Pao, “Stochastic choice of basis functions in adaptive function approximation and the functional-link net,” IEEE Transactions on Neural Networks, vol. 6, no. 6, pp. 1320–1329, 1995. [17] D. Husmeier, Neural Networks for Conditional Probability Estimation: Forecasting Beyond Point Predictions. Springer Science and Business Media, 2012. [18] I. Y. Tyukin and D. V. Prokhorov, “Feasibility of random basis function approximators for modeling and control,” in Proceedings of IEEE International Conference on Control Applications and Intelligent Control, 2009, pp. 1391–1396. [19] A. N. Gorban, I. Y. Tyukin, D. V. Prokhorov, and K. I. Sofeikov, “Approximation with random bases: Pro et contra,” Information Sciences, vol. 364, pp. 129–145, 2016.
[20] M. Li and D. Wang, “Insights into randomized algorithms for neural networks: Practical issues and common pitfalls,” Information Sciences, vol. 382–383, pp. 170–178, 2016. [21] L. K. Jones, “A simple lemma on greedy approximation in hilbert space and convergence rates for projection pursuit regression and neural network training,” The Annals of Statistics, vol. 20, no. 1, pp. 608–613, 1992. [22] T. Y. Kwok and D. Y. Yeung, “Constructive neural networks: Some practical considerations,” in Proceedings of IEEE International Conference on Neural Networks, 1994, pp. 198–203. [23] P. Lancaster and M. Tismenetsky, The Theory of Matrices: With Applications. Elsevier, 1985. [24] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006. [25] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006. [26] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [27] K. Zeng, J. Yu, R. Wang, C. Li, and D. Tao, “Coupled deep autoencoder for single image super-resolution,” IEEE Transactions on Cybernetics, vol. 47, no. 1, pp. 27–37, 2017. [28] D. Wang and M. Li, “Deep stochastic configuration networks with universal approximation property,” arXiv:1702.05639, 2017. [29] D. Wang and M. Li, “Robust stochastic configuration networks with kernel density estimation for uncertain data regression,” Information Sciences, vol. 412-413, pp. 210–222, 2017. [30] D. Wang and C. Cui, “Stochastic configuration networks ensemble with heterogeneous features for large-scale data analytics,” Information Sciences, vol. 417, pp. 55–71, 2017. [31] S. Scardapane, D. Wang and A. Uncini, ”Bayesian random vector functional-link networks for robust data modeling,” IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2017.2726143, 2017. [32] S. Scardapane, D. Wang, M. Panella and A. Uncini, ”Distributed learning for random vector functional-link networks,” Information Sciences, vol. 301, pp. 271–284, 2015.
Dianhui Wang (M’03-SM’05) was awarded a Ph.D. from Northeastern University, Shenyang, China, in 1995. From 1995 to 2001, he worked as a Postdoctoral Fellow with Nanyang Technological University, Singapore, and a Researcher with The Hong Kong Polytechnic University, Hong Kong, China. He joined La Trobe University in July 2001 and is currently a Reader and Associate Professor with the Department of Computer Science and Information Technology, La Trobe University, Australia. His current research interests include data mining and computational intelligence techniques for bioinformatics and engineering applications, and randomized learning algorithms for big data analytics. Dr Wang is a Senior Member of IEEE, and serves as an Editor-in-Chief for International Journal of Machine Intelligence and Sensory Signal Processing, Associate Editor for IEEE Transactions On Neural Networks and Learning Systems, IEEE Transactions On Cybernetics, Information Sciences, and Neurocomputing.
Ming Li was awarded a Bachelor degree in Mathematics from Shandong Normal University, China and a Master degree in Applied Mathematics from China Jiliang University, in 2010 and 2013 respectively. He is currently pursuing his PhD in Computer Science with the Department of Computer Science and Information Technology, La Trobe University, Australia. His research interests include randomized learning algorithms for neural networks, robust modelling techniques and large scale data analytics.