May 21, 2018
16:17
WSPC/S0218-1274
1850068
International Journal of Bifurcation and Chaos, Vol. 28, No. 5 (2018) 1850068 (22 pages) c World Scientific Publishing Company DOI: 10.1142/S0218127418500682
Brain-Inspired Constructive Learning Algorithms with Evolutionally Additive Nonlinear Neurons Le-Heng Fang School of Mathematical Sciences, Fudan University, Shanghai 200433, P. R. China School of Data Science, Fudan University, Shanghai 200433, P. R. China
[email protected] Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
Wei Lin∗ School of Mathematical Sciences, Fudan University, Shanghai 200433, P. R. China Centre for Systems Computational Biology and Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, P. R. China Shanghai Key Laboratory of Contemporary Applied Mathematics and the LMNS (Fudan University), Ministry of Education, P. R. China
[email protected] Qiang Luo Centre for Systems Computational Biology and Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, P. R. China School of Life Science, Fudan University, Shanghai 200433, P. R. China
[email protected] Received April 1, 2018 In this article, inspired partially by the physiological evidence of brain’s growth and development, we developed a new type of constructive learning algorithm with evolutionally additive nonlinear neurons. The new algorithms have remarkable ability in effective regression and accurate classification. In particular, the algorithms are able to sustain a certain reduction of the loss function when the dynamics of the trained network are bogged down in the vicinity of the local minima. The algorithm augments the neural network by adding only a few connections as well as neurons whose activation functions are nonlinear, nonmonotonic, and self-adapted to the dynamics of the loss functions. Indeed, we analytically demonstrate the reduction dynamics of the algorithm for different problems, and further modify the algorithms so as to obtain an improved generalization capability for the augmented neural networks. Finally, through comparing with the classical algorithm and architecture for neural network construction, we show that our constructive learning algorithms as well as their modified versions have better performances, such
∗
Author for correspondence
1850068-1
May 21, 2018
16:17
WSPC/S0218-1274
1850068
L.-H. Fang et al.
as faster training speed and smaller network size, on several representative benchmark datasets including the MNIST dataset for handwriting digits. Keywords: Nonlinear neuron; brain-inspired neural network; constructive learning algorithm; nonmonotonic activation function; training dynamics.
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
1. Introduction Artificial neural network (ANN), inspired by the biological knowledge of neural system in neuroscience, has attracted wide attention from communities of both academy and industry [Krizhevsky et al., 2012; Hinton et al., 2012a; Sutskever et al., 2014; Silver et al., 2016; Vromen & Steur, 2016; Esteva et al., 2017; Hramov et al., 2018]. From the classic feedforward network to the most recent deep network, network design, such as network size and activation function, has always become a core topic in the research of ANN [Hassoun, 1995; Haykin, 2009; Du & Swamy, 2013; Hagan et al., 2014]. However, the theory of determining key parameters in network design for a given problem remains incomplete, which needs more knowledge and inspiration from the fields of biology and neuroscience for further development and improvement. On one hand, most of the current learning algorithms of ANN are applicable to a predefined network structure with preselected activation functions for neurons [Hassoun, 1995; Haykin, 2009; Du & Swamy, 2013; Hagan et al., 2014]. Theoretically, an ANN even with a single hidden layer is proved to have capability of approximating any measurable function or functional to any desired accuracy level [Hornik et al., 1989; Magoulas & Vrahatis, 2006; Li et al., 2015]. However, with in-depth applications to complex problems, the theory has confronted difficulties such as under- or over-fitting and poor generalization capability [Hassoun, 1995; Haykin, 2009; Du & Swamy, 2013; Hagan et al., 2014]. Although these difficulties have been majorly remedied by the tremendous development of deep learning, deep networks and big-data collection techniques [Hinton et al., 2006; LeCun et al., 2015; Goodfellow et al., 2016], empirical determination of the network design can usually take substantial time and resources by trying a tremendous amount of possible combinations before reaching some relatively optimal design in practice [Sharma & Chandra, 2010]. To address this problem in applications, the proposed and systematically investigated were the
constructive neural networks (CoNNs), whose network structures, inspired physiologically by brain’s growth and development, grow regularly along with the learning processes [Sharma & Chandra, 2010; Parekh et al., 2010; Ma & Khorasani, 2004; Ash, 1989; Lebiere & Fahlman, 1990; Kwok & Yeung, 1997; Friedman & Stuetzle, 1981; Platt, 1991; Farlow, 1984; Hirose et al., 1991; Rivals & Personnaz, 2003; Islam et al., 2009; Yang & Chen, 2012; Wu et al., 2015]. More specifically, the dynamic node creation (DNC) algorithm [Ash, 1989] and the cascadecorrelation (CC) algorithm [Lebiere & Fahlman, 1990] are the two most well-known CoNN algorithms. Particularly, the DNC algorithm builds a one-hidden-layer, forward ANN by adding neurons into the hidden layer, while the CC algorithm builds a multiple-hidden-layer network by adding one node into each layer. The CC algorithm is thus able to overcome the difficulties which the DNC algorithm encounters in solving complex problems [Lebiere & Fahlman, 1990]. Also there are some other categories of the CoNN algorithms, such as the projection pursuit regression algorithm [Kwok & Yeung, 1997; Friedman & Stuetzle, 1981], the resourceallocating networks algorithm [Platt, 1991], and the group methods of data handling [Farlow, 1984]. Naturally, the CoNNs are the most likely to become oversized in the process of adding neurons or connections to the networks. To avoid this defect, some delicate pruning algorithms have been subsequently developed [Hirose et al., 1991; Rivals & Personnaz, 2003; Islam et al., 2009; Yang & Chen, 2012; Wu et al., 2015]. However, it is still difficult in practice to manipulate the network size and specify the exact step where the process is terminated for achieving an optimal balance between the network size and the learning accuracy. On the other hand, neurons with specific nonmonotonic activation functions seem, at first thought, biologically unrealistic. However, nonlinear, nonmonotonic responses can emerge as a result from activities of an ensemble of real but little
1850068-2
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
neurons or from an appropriate activation of excitatory and inhibitory connections [Trappenberg, 2010; Lu et al., 2010]. Inspired by these biological evidences, specific forms of nonmonotonic activation functions have been proposed for network design [Morita, 1993] and successfully applied to enhancing the storage capacity of ANNs in realizing memory association or solving other complex problems [Yoshizawa et al., 1996; Yanai & Amari, 1996; Adachi & Aihara, 1997; Crespi, 1999; Chen & Aihara, 2000; Nara, 2003, 2008; Lin & Chen, 2009; Ma & Wu, 2009]. Inclusion of specific nonmonotonic activation functions, such as the Hermite polynomials, in the CoNNs has been realized [Ma & Khorasani, 2005] while this particular form of activation functions probably result in low efficiency in network construction. This thus requires further indepth investigations along the direction of designing appropriate forms and connections for the additive neurons. In this article, to make progress along the above-indicated direction and to achieve a fine balance between the network size and learning accuracy, we propose novel constructive learning algorithms through adding particular neurons that are nonmonotonic and adapted to the current values of network loss functions. This augmentation and adaptation procedure is unlike the method proposed in the literature where the selection of activation functions is based directly on the network’s inputs [Yanai & Amari, 1996]. The inspiration of designing these algorithms is partially from the biological evidence that physiological growth and development of brain can be enhanced particularly by external stimulations [Gallo, 2007; Kukley et al., 2007; Ziskin et al., 2007]. The remainder of this article is organized as follows. In Sec. 2, constructive learning algorithms for regression problems and classification problems are formulated, respectively. Indeed, these algorithms are analytically validated to be useful in reducing the network loss functions through adding specific nonlinear neurons. In Sec. 3, a technique of filtering the nonmonotonic activation function is proposed to promote the generalization capability of the augmented neural network, and a concrete example of regression problem is used to validate this technique. In Sec. 4, several real-world datasets, including the Iris flower dataset, the Cleveland clinic heart disease dataset, and the MNIST dataset of handwriting digits, are used as the benchmarks to
illustrate the quality of our proposed algorithms in classification problems. Finally in the last section, the article is closed by some concluding remarks and further perspectives.
2. The Role of Constructive Learning Algorithms with Additive Nonlinear Neurons In this section, for regression problems and classification problems, we, respectively, propose constructive learning algorithms to reduce their network loss functions through adding nonlinear neurons with particularly-designed, nonmonotonic activation functions. Indeed, we provide analytical estimations for these reductions of the corresponding problems.
2.1. Algorithms for regression problems Suppose there are K training samples of mdimension, denoted by X p = [xp1 , xp2 , . . . , xpm ] (p = 1, 2, . . . , K), with the output dimension n. Then, the corresponding network outputs of the training samples X p are denoted by Y p = [y p1 , y p2 , . . . , y pn ], and the training labels of the training samples X p are denoted by T p = [tp1 , tp2 , . . . , tpn ]. Hence, the total loss between the outputs of the network and the labels of the training set becomes L=
n j=1
n
K
1 p Lj = (y j − tpj )2 . 2 j=1 p=1
Using the classical back-propagation algorithm with a given network structure, the training process of a traditional neural network is terminated once the ∂L ≈ 0. connection weight w in the network satisfies ∂w However, in many cases, the network is likely to fall into the local minima as the training is terminated. Denote by L = at this terminal time. In real applications, is almost surely larger than zero. Now, we hope to reduce the value of L = significantly through adding a neuron with the activation function f (x) connecting either the input layer or any hidden layers with the outputs of the output layer, as schematically shown in Fig. 1. Specifically, we are to design new connection weights and new activation function f (x) in the following two situations: (1) the new neuron has a single connection with the output of one neuron in the output layer
1850068-3
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
L.-H. Fang et al.
Fig. 1. Schematic diagrams for regression problems. (a) The left panel describes the added, nonmonotonic neuron, which connects the input or any hidden layers with the output of one neuron in the output layer and (b) the right panel corresponds to the added, nonmonotonic neuron, connecting the input or any hidden layers with the outputs of multiple neurons in the output layer.
[see Fig. 1(a)], and (2) the new neuron has multiple connections with the outputs of some neurons in the output layer, including the fully-connected case [see Fig. 1(b)]. As for the connections of the hidden layers with the outputs of the output layers, analogous configurations can be established and the reduced amount can be similarly estimated.
and then the total loss pending for reduction becomes n ˆ ˆj L= L j=1 n
=
K
1 p (ˆ y j − tpj )2 . 2 p=1 j=1
2.1.1. Single connection situation Theorem 1. Set l := arg maxj∈{1,2,...,n} Lj in which
p p 2 Lj := 12 K p=1 (y j − tj ) . Then, appropriately adding a new neuron connected with the output of the lth ˆ output neuron can reduce the total loss from L to L. ˆ Correspondingly, the reduced amount |∆L| = |L− L| becomes at least K
Ll =
1 p L (y l − tpl )2 ≥ . 2 n
Each wi in Eq. (1) represents the connection weight between the newly-added neuron and the input neuron xi , wl in Eq. (1) stands for the connection weight between the added neuron and the output yl . ˆ = L, If w l equals 0, L remains unchanged, i.e. L ˆ while the partial derivative of L with respect to w l is certainly nonzero. Therefore, we can continue to use the standard gradient descent method to reduce the loss function of this new network. To validate this, we first perform calculations as follows:
p=1
With the configurations set in the theorem, the outputs of the network become m p p p yˆj = yj + w l f w i xi , j = l, (1) i=1 yˆp = yp , j = l j j
Proof.
1850068-4
ˆl ˆ ∂L ∂L = ∂wl ∂w l K ˆ l ∂ yˆp ∂L l = ∂ yˆpl ∂w l p=1
m K = (ˆ y pl − tpl )f wi xpi p=1
i=1
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons K
= wl
f
2
m
p=1
i=1
m K p p p (y l − tl )f wi xi . + p=1
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
w i xpi
ˆ l − Ll = ∆L = ∆Ll = L
i=1
m
f
i=1
+
K
i=1
(y pl − tpl )f
m i=1
wi xpi
w i xpi dwl
m K 1 = f2 w i xpi w2l 2 p=1
i=1
∆wl m K + (y pl − tpl )f wi xpi w l p=1 i=1
2
=−
p=1
i=1
2
K
f2
m
p=1
i=1
K
=−
(3)
The values of the function f (x) at any other points in the domain are defined by the piecewise linear functions connecting with the values of the function at theabove Kdifferent points. Clearly, if L is p K m 2 nonzero, p=1 f ( i=1 wi xi ) is nonzero. Accordingly, it is possible to change w l from 0 to ∆wl with
w l =0
K p=1
f2
m i=1
w i xpi
1 p (y − tpl )2 . 2 p=1 l
Consequently, we have n
|∆L| ≥
K
n
1 L 1 p (y j − tpj )2 = Lj = , 2n n n j=1 p=1
p = 1, 2, . . . , K.
0
m K p p (y − t )f wi xpi l l
i=1
:= y pl − tpl ,
ˆ ∂L ∆wl := − ∂w l
f2
ˆl ∂L dwl ∂w l
w i xpi
K
0
m
p=1
p=1
wi xbi .
Hence, we construct the activation function f (x) by defining its values at K different points w, X p = m p i=1 w i xi (p = 1, 2, . . . , K) as m
w l
0
i=1
wi xai =
∆w l
=
(2)
We need to design the weights w i (i = 1, 2, . . . , m) connecting the added neuron with the input neurons, and also design the activation function f (x) for making the absolute value of ∆L as large as possible. Without loss of generality, we assume that for any two training samples X a,b , X a = X b . Then, X a − X b (a = b and a, b ∈ {1, 2, . . . , K}) together are C 2K nonzero m-dimensional vectors, where C ts represents the t-combination of a set containing s elements. Hence, we select an m-dimensional vector, denoted by w, such that w is not orthogonal to all these vectors, i.e. w, X a − X b = 0 for a = b and a, b ∈ {1, 2, . . . , K}. Here, ·, · stands for the inner product of two vectors. We write the vector w component-wisely as w = [w1 , w 2 , . . . , w m ]. Then, for any a, b ∈ {1, 2, . . . , K}, we have m
∆w l
−1 wi xpi = −1,
where the calculation uses the result obtained in Eq. (2) and the function values defined in Eq. (3). Thus, we obtain that the total loss varies in a manner as
which completes the proof.
j=1
Theorem 1 implies that by adding a specific neuron, the total loss L can be reduced by a certain amount. Afterwards, we use appropriate methods, such as the gradient descent method, for further training of the network. This training proceeds until some kind of terminal criterion is fulfilled, for instance, the partial derivatives of the total loss with respect to any weight in the new network all approach satisfactorily small amounts or they are unsatisfactorily small, requiring an introduction of another nonmonotonic neuron.
2.1.2. Multiconnection situation As discussed in the preceding situation, the newlyadded neuron is connected with the output yl of the
1850068-5
May 21, 2018
16:17
WSPC/S0218-1274
1850068
L.-H. Fang et al.
lth output neuron. We then train the augmented network by any appropriate method until some terminal criterion is satisfied. Accordingly, the network output of each training sample X p becomes (1,p) (1,p) (1,p) Y (1,p) = [y 1 , y 2 , . . . , y n ], and the network loss function changes to L(1) . Now, in addition to the existing connection, we create the second connection of this newly-added neuron with the output of the l(1) th output neuron. Here, l
(1)
:=
arg max j∈{1,2,...,n}\{l}
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
and (1) Fj
Thus, the loss changes as ∆L(k−1) ˆ (k−1) − L(k−1) =L m 2 K (k−1,p) wi xpi y l(k−1) − tpl(k−1) f =−
p=1
2
(1) Fj
=−
(k−1)
Fj
sup |∆L
:=
2
p=1
(k−1,p)
yj
− tpj (y pl − tpl ). (k−1)
m −1 K × f2 wi xp K
=−
p=1
i=1
i
(k−1,p) yl(k−1) − tpl(k−1) (y pl − tpl ) K (y pl − tpl )2 p=1
(k−1)
K
2 1 (k−1,p) |= y l(k−1) − tpl(k−1) 2 p=1 (k−1)
Here, |∆L(k−1) | approaches this supremum if and only if p2 (k−1,p1 ) 1 (y l − tpl 2 ) y l(k−1) − tpl(k−1) (4)
for any p1 , p2 ∈ {1, 2, . . . , K}. Therefore, after creation of the kth connection, the network loss changes to L(k) with L(k) ≤ L(k−1) + ∆L(k−1) . This reveals that the network loss function can be persistently reduced. All we present above include the constructive learning algorithms which allow us to add one specific neuron to the neural network for reducing the loss function for regression problems. Next, we show how to design effective, constructive learning algorithms for classification problems.
wl =0
p=1
(y pl − tpl )2
p1 (k−1,p ) 2 (y l − tpl 1 ) = y l(k−1) 2 − tpl(k−1)
Also, we initialize the kth connection weight w l as ˆ (k−1) ∂L (k−1) =− ∆wl (k−1) ∂w l
,
= Ll(k−1) .
and (k−1)
w i xpi
which, by virtue of the Cauchy–Schwarz inequality, further yields
in which the second equality follows from the defined values in Eq. (3). Then, we train the network until L(1) convergence. Recursively, we create the kth connection of this newly-added neuron with (k−1) the output yl(k−1) of the l(k−1) th output neuron, as intuitionally shown in Fig. 1(b). Here,
Fj
K p=1
K
K 1
p=1
2
1 (1,p) [y − tpj ](y pl − tpl ), = 2 p=1 j
j∈{1,2,...,n}\{l,l(1) ,...,l(k−2) }
i=1
i=1
2 K (k−1,p) y l(k−1) − tpl(k−1) (y pl − tpl )
i=1
arg max
f2
m
p=1
m K 1 (1,p) p p := [y − tj ]f wi xi 2 p=1 j
l(k−1) :=
K
.
2.2. Algorithms for classification problems To solve classification problems in real applications, we usually train neural networks by minimizing the 1850068-6
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
Fig. 2. Schematic diagrams for classification problems. (a) The left panel describes the added, nonmonotonic neuron connecting the input or hidden layers with one of the inputs in the output softmax layer and (b) the right panel corresponds to the added neuron connecting the input or hidden layers with multiple inputs in the output softmax layer.
cross-entropy. When a trained network falls into some local minima with a very limited convergence rate, we still attempt to expedite the convergence by adding a neuron with a specifically-designed activation function f (x) connecting either the input layer or any hidden layers with single or multiple inputs of the output softmax layer in the network (see Fig. 2). To this end, we suppose there are K training samples, denoted by X p (p = 1, 2, . . . , K), with the input dimension m and the total category number n. The corresponding network outputs of the training sample X p is Y p , and the training label of X p is T p = [tp1 , tp2 , . . . , tpn ], where tpj = 1 if X p belongs to the jth category; otherwise tpj = 0. Then, the maximum-likelihood loss of the network is in a form of L=−
K n j=1 p=1
tpj ln y pj ,
p p where ypj = eaj ( ni=1 eai )−1 and apj (j = 1, 2, . . . , n) are, respectively, the outputs and inputs of the output softmax layer for the training sample X p . Akin to the discussions performed in the last subsection, we divide the following discussions into two situations, viz. single connection and multiconnection.
2.2.1. Single connection situation Theorem 2. We set
l := arg max
K
j∈{1,2,...,n} p=1
(y pj − tpj )2 .
Then, appropriately adding a new neuron connected with the lth input of the output softmax layer can reduce the total loss by an amount of |∆L| = p p 2 2 K p=1 (y l − tl ) . With the configurations assumed in the theorem, the inputs of the output softmax layer can be expressed as: m a wi xpi , j = l, ˆpj = apj + w l f (5) i=1 p p a j = l. ˆj = aj ,
Proof.
p
Correspondingly, the outputs becomes yˆpj = eaˆj × p ( ni=1 eaˆi )−1 for j = 1, 2, . . . , n, and the loss changes to
1850068-7
ˆ=− L
K n j=1 p=1
tpj ln yˆpj .
May 21, 2018
16:17
WSPC/S0218-1274
1850068
L.-H. Fang et al. p
Here, each w i in Eq. (5) is the connection weight between the newly-added neuron and the input neuron xi .1 The weight wl is initialized to zero, which ˆ = L. Next, we need to reduce L ˆ as much yields L ˆ with as possible. To this end, the derivative of L respect to w l gives K ˆ ∂ˆ ˆ apl ∂L ∂L = ∂w l ∂ˆ apl ∂w l p=1
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
p=1
j=1
i=1
lim yˆp wl →−∞ l or wl →+∞
=
=
m p p −eaˆj eaˆl p w i xi n 2 f yˆpj i=1 a ˆpi e
K (ˆ y pl − tpl )f
m
p=1
i=1
n
or wl →+∞
p
p
i=1
n i=1
p
eaˆi
2
f2
wi xpi )
P w l →−∞ p apl +w l f( m or i=1 w i xi ) w l →+∞ e
0 (or 1),
if f
m i=1
if f
m i=1
+
n
p
eai
i=l
wi xpi
> 0,
wi xpi
< 0.
p=1 or w l →+∞
≤ 0 or wi xpi ×f i=1 ≥ 0, m
wi xpi ,
p
lim
i=1
K
eaˆi − eaˆl eaˆl
Pm
eal +wl f(
p ˆ ∂L = lim (ˆ y l − tpl ) wl →−∞ ∂w l w l →−∞
(6)
i=1
p=1
(7)
lim
where the last equation is the same as that in the first line of Eq. (2) for the regression problems. Further derivative of Eq. (6) with respect to w l yields: m K ˆ ∂ yˆpl ∂ 2L p = f w i xi ∂w ∂w 2l l p=1
=
i=1
w i xpi
These, together with Eq. (6) and tpl ∈ {0, 1} for all p, give
i=1
K
2
eaˆi
1 (or 0),
tpj
p
p
2 f
> 0,
i=1
n a ˆpl a ˆpi a ˆpl a ˆpl e e − e e K p tl i=1 = − p n 2 yˆl p p=1 eaˆi
eaˆl
n
m
i=1
=
i=l
p
i=1
j=l
j=l
p
eaˆi
ˆ
p p K m p p t ∂ y ˆ tl ∂ yˆl j j p − p p − f = w i xi p p y ˆ ∂ˆ a y ˆ ∂ˆ a j l l l p=1
−
∂L is a monotonically increaswhich indicates that ∂w l ing function. Notice that
p K n m ˆ ∂ y ˆ ∂ L j p f = wi xi ∂ yˆpj ∂ˆ apl p=1
=
K
eaˆl
m i=1
w i xpi
(8)
where the limits equal to zero simultaneously if and p only if f ( m i=1 wi xi ) = 0 for all p = 1, 2, . . . , K. Now, we are to construct the activation function f (x) for the newly-added neuron. To this end, we select a vector w = (w1 , w 2 , . . . , w m ) such that w, X a − X b = 0 for a = b and a, b ∈ {1, 2, . . . , K}. Then, define f (x) at K different points w, X p (p = 1, 2, . . . , K) in the same manner as the definition used in Eq. (3). Similarly, define f (x) at any other points in the domain through the piecewise linear functions connected by the values of the above K points. With thisconfiguration, we conclude that if p L is nonzero, f ( m i=1 w i xi ) are not all zero. Additionally, from Eq. (6) and the definition of f (x), it
1
The case, where the newly-added neuron is connected with some hidden neuron hi through the weight wi , could be analogously discussed. 1850068-8
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
ˆ ∂L ≥ ∂w l
follows that m K ˆ p ∂L p p = w i xi yˆl |wl =0 − tl f ∂w l p=1
w l =0
=
i=1
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
e , −
∂w 2l
dw dwl
wl
m K 1 2 f wi xpi dw dwl 4 p=1 i=1
wl
w l =0
0 m K 1 2 1 p 2 + f w i xi wl 4 2 p=1
From this estimation, it follows that ˆ ∂L ∂w l
=
p=1
1 2· 4
w l =∆wl
i=1
i=1
K
f
2
m
p=1
K (y pl − tpl )2 , =2
ˆ ∂ 2L dwl ∂w 2l
p=1
= 0, −∞ is regarded as the zero point. 1850068-9
∆wl
m 2 K (y p − tp )f w i xpi l l
(10)
ˆ ∂L ∂w l
0
ˆ ∂ L = ∂w l
i=1
If limwl →−∞
wl
ˆ ∂ 2L(w)
m K K 1 p 1 2 f wi xpi = (y − tpl )2 . ≤ 4 p=1 4 p=1 l
2
−
0
w l =0
i=1
∆wl
w l =0
0
a ˆpi
gives the following estimation for Eq. (7): p p eaˆl eaˆi m K ˆ ∂ 2L i=l 2 = w i xpi n 2 f ∂w 2l p p=1 a i=1 eˆ i
wl =0
0
ˆ ∂ L ≥ ∆wl ∂w l
i=l
0
w l =0
ˆ ∂ L ≥ ∆wl ∂w l
i=l
i
i=1
as shown in Fig. 3. Therefore, ∆w l ≥ w0 , with which the reduced amount in Eq. (9) can be further estimated as: 0 2ˆ 0 ˆ ∂ L(w) ∂L − |∆L| = 2 dw dwl w0 ∂w l wl ∂wl
Moreover, using the inequality 2 2 n p p p eaˆi = eaˆl + eaˆi
p=1
= −4 < 0,
(9)
i=1
m −1 K · f2 wi xp
wl =0
wl =0
−
(11)
ˆ ∂L ∆wl := −4 ∂w l
ˆ
∂L This, together with (7) and (8), ensures that ∂w l has a unique and negative zero point.2 Denote this zero point by w0 such that w0 < 0. Then, we have ˆ−L ∆L = L 0 ˆ ∂L dwl =− ∂w l w0 0 0 2ˆ ˆ ∂ L(w) ∂L =− − 2 dw dwl . ∂w l w l ∂w l w0
ˆ ∂L = ∂w l
∆wl
1 p (y − tpl )2 dwl 4 p=1 l
where we set
> 0.
≥ 4e
K
0
= 0,
(y pl − tpl )2
a ˆpl
−
wl =0
K p=1
i=1
wi xpi
(12)
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
L.-H. Fang et al.
ˆ
∂L Fig. 3. A sketch for the function of ∂w with respect to the argument wl , where ∆wl = −4 is no less than w0 . Here, w0 is l the unique zero point of this function.
where the second inequality uses the estimation obtained in (10). Therefore, the final estimation implies a completion of the proof.
as |∆L| = 2
K K (y plp − 1)2 ≤ − ln y plp = L, p=1
Theorem 3. If the total loss L > 0, the reduced
amount can be estimated as |∆L| < L. Due to the configurations of the classification problems, we set tplp = 1 with lp ∈ {1, 2, . . . , n} and p = 1, 2, . . . , K. Hence, we have Proof.
L= −
K p=1 j=lp
tpj ln y pj = −
and |∆L| = 2
l=lp
≤2
l=lp
=2
K p=1
K p=1
(y pl )2 + 2
tlp ln yplp = −
l=lp
p=1
ln y plp
(y pl − 1)2
(1 − y plp )2 + 2
(y plp − 1)2 .
K
l=lp
p=1
where the equality is valid if and only if y plp = 1 for all p = 1, 2, . . . , K. However, this equality situation results in L = 0. Therefore, we have |∆L| < L provided L > 0. Remark 2.1. Theorems 2 and 3 established above reveal that adding a neuron to the network can reduce the total loss L by a certain amount but cannot reduce it completely if it is strictly larger than zero. Also, the reduced amount could be somewhat optimally extended if the third-order derivative of ˆ with respect to wl is further computed based on L the second-order derivative computed in (10).
2.2.2. Multiconnection situation
(y plp − 1)2
(13)
Notice that the derivative of the function g(x) = 2(x − 1)2 + ln x with respect to x ∈ (0, 1] yields 1 1 g (x) = 4x + − 4 ≥ 2 4x − 4 = 0. x x
The discussion of multiconnection situation for classification problems is analogous to that for regression problems. Specifically, after a creation of the (k − 1)th connection between the newly-added neu(k−2) ron and the input al(k−2) of the l(k−2) th output neuron in the output softmax layer, a further training is performed and the network loss function changes to L(k−1) . We continue to create the kth connection between the newly-added neuron and the input (k−1) al(k−1) of the l(k−1) th output neuron. Here,
This implies that g(x) ≤ g(1) = 0 for x ∈ (0, 1]. Therefore, |∆L| in Eq. (13) can be further estimated 1850068-10
l(k−1) =
arg max j∈{1,2,...,n}\{l,l(1) ,...,l(k−2) }
(k−1)
Hj
,
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
3. Filtering Activation Functions: Promotion of Generalization Capability
where (k−1)
Hj
:=
K
p=1
(k−1,p)
yj
− tpj (y pl − tpl ).
Now, we initialize the kth connection weight as (k−1) ˆ ∂ L (k−1) ∆wl = −4 (k−1) ∂w l w l =0
×
K
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
p=1
f2
m
(k−1) wl
−1 w i xp i
i=1
K (k−1,p) y l(k−1) − tpl(k−1) (y pl − tpl )
= −4
p=1
K p=1
. (y pl − tpl )2
Hence, using the estimation akin to the penultimate equation in (12) gives ˆ (k−1) − L(k−1) | |∆L(k−1) | = |L 2 m K (k−1,p) w i xpi y l(k−1) − tpl(k−1) f ≥
p=1
K
1 2 f 2 p=1
m i=1
i=1
w i xpi
2 K (k−1,p) y l(k−1) − tpl(k−1) (y pl − tpl )
=2
p=1
K (y pl − tpl )2
.
As proposed in the preceding section, the activation function f (x) for the newly-added neuron is defined as Eq. (3), where y pl − tpl of the K training samples are used. This kind of construction makes the domain of f (x) contain up to K + 1 nonmonotonic intervals, which thus enables the augmented neural network to fit very well with the training set even with noise perturbations. However, this kind of direct construction is likely to render the augmented neural network overfitting and having weak generalization capablity. One possible way to somewhat solve these problems is to adjust the weights w l not directly to −1 for regression problems (resp., −4 for classification problems) as obtained in Sec. 2, but to some point in between the interval (−1, 0) (resp., (−4, 0)). However, such construction of f (x) still uses all the training samples, which thus assimilates the noise perturbations during the network training and cannot remarkably promote the generalization capability of the trained network. Therefore, additional technique, which can be directly used to modify the construction of activation functions, is expected to be developed. In this section, we first use an example to illustrate the effectiveness of the constructive learning algorithm that we have proposed in Sec. 2, and then point out the problems of overfitting and weak generalization. To solve these problems effectively, we propose a technique, and validate it through coping with the same overfitting example. Example 3.1. Consider a regression problem of the
signal
p=1
Here, the lower bound for |∆L(k−1) | approaches its (k−1,p) 2 p if and only if supremum 2 K p=1 y l(k−1) − tl(k−1) the condition, which is the same as Eq. (4), is satisfied. Consequently, as long as K (k−1,p) y l(k−1) − tpl(k−1) (y pl − tpl ) = 0, p=1
in order to ensure the reduction of the network loss function, we can always create the kth connection between the newly-added neuron and the input (k−1) al(k−1) of the l(k−1) th output neuron in the output softmax layer.
g(x) =
4 (2j − 1) sin[(2j − 1)x] j=1
with x ∈ [−2π, 2π], as shown by the dash line in Fig. 4. The training samples, depicted by the dots in Fig. 4, are generated by the noise-perturbed signal t(x) = g(x) + , where the noise perturbation obeys the standard Gaussian distribution N (0, 3). As shown in Fig. 5, with different numbers (10 and 50) of sigmoidal neurons, the conventional neural network models easily show under-fitting phenomena after they are trained with the samples t(x). With sigmoidal neurons more than 100, the conventional model is able to fit relatively well the
1850068-11
May 21, 2018
16:17
WSPC/S0218-1274
1850068
L.-H. Fang et al.
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
Fig. 4. The original signal g(x), which contains uniformly distributed 368 points, is used as the testing data in the following study, and the noise-perturbed signal t(x) = g(x) + , which consists of totally 128 points, is used as the training samples.
training samples t(x). The more the number of neurons, the more the trained neural network can achieve accurately mostly on the training samples. However, as shown in Fig. 5, this accuracy cannot be realized on the testing data which consists of
the noise-free signal g(x). This is a very typical example of overfitting phenomena when the conventional neural network is taken into account. Next, we follow the method proposed in the last section, and then add one neuron to the neural network with sigmoid-like neurons. Unlike the conventional neural network, only using five sigmoidal neurons and one nonmonotonic neuron suffices to get a satisfactory fitting performance on the training set [see Fig. 6(a1)]. Although the residual is significantly reduced compared to results in Figs. 5(b1)–5(b3), overfitting phenomenon, as shown in Fig. 6(a1), still needs further amendments. Actually, for a neural network with an output of only dimension one, through adding a neuron of specific structure, the reduced of the loss amount p − tp )2 = L, function becomes ∆L = − 12 K (y p=1
Fig. 5. From under-fitting to overfitting phenomena when the conventional neural network models with different numbers of sigmoidal neurons are taken into account, respectively. (a1)–(a4) Regression curves produced by the trained network models versus the original signal g(x), the testing data. (b1)–(b4) Absolute values of residuals between the regression curves and g(x) are always fluctuating in relatively broad ranges. Here, each horizontal dash line with a decimal indicates the average absolute value of the corresponding residual. 1850068-12
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
Fig. 6. Fitting performances of the neural network models using the constructive learning algorithm with only five sigmoidal neurons and one nonmonotonic neuron. Regression curve (a1) and absolute values of residual (b1) produced by the model which is not operated by any average smooth filter. Regression curve (a2) and absolute values of residual (b2) produced by the model which is operated by an average smooth filter with a parameter λ = 0.73. Here, horizontal dash lines with decimals in (b1) and (b2) indicate the average absolute values of the corresponding residuals.
according to Theorem 1. So, the loss function eventually becomes zero and the overfitting problem is consequently unavoidable. Here, we propose a technique to avoid overfitting problem by using an average-smooth-filter with an adjustable parameter for the constructed activation function. As presented below, this enables the augmented network to learn from redundant information from the noise data as little as possible, and simultaneously leave more space for training the network persistently. Filtering activation functions — We arrange the K points m i=1
wi xpi
(p = 1, 2, . . . , K),
as mentioned in Eq. (3), in an ascending order, and denote these ordered numbers by hk (k = 1, 2, . . . , K). Thus, for any k ∈ {1, 2, . . . , K}, there a pk ∈ {1, 2, . . . , K} such that hk = m exists pk w x and hk1 ≤ hk2 for any k1 < k2 . From i=1 i i the defined values in Eq. (3), it follows that the constructed activation function f (x) is a piecewise linear function specified by p y k − tpk , if x = hk , f (x) = 0, if x < h1 or x > hK .
outside of h1,K yields: fˆ(x) := F (λ) ∗ f hk+1 − hk 1 f (hk−1 ) 1 + λ h k+1 − hk−1 hk − hk−1 f (h ) + λf (h ) + k+1 k , hk+1 − hk−1 if x = hk , 1 < k < K, = f (h2 ) + λf (h1 ) , if x = h , 1 1+λ f (hK−1 ) + λf (hK ) , if x = hK , 1+λ 0, if x < h1 or x > hK , where λ is a non-negative smoothing parameter pending for selection. The whole filtered function fˆ(x) can be compactly written as K−2 (x − h )(h − h ) k k+1 k fˆ(hk ) + fˆ(x) := fˆ(hk+1 ) − fˆ(hk ) k=1
Now, implementing an operation of the averagesmooth-filter F (λ) on the function f (x) at hk and 1850068-13
× max{min[sign(x − hk ) + 1, 1], 0} × max{min[sign(hk+1 − x), 1], 0} + fˆ(hK ) · max{1 − abs[sign(x − hK )], 0}.
May 21, 2018
16:17
WSPC/S0218-1274
1850068
L.-H. Fang et al.
In order to train the augmented neural network with the above-filtered activation function fˆ, the derivative in either Eq. (2) or Eq. (6) should be nonzero, that is, K K (y pk − tpk )fˆ(hk ) = f (hk )fˆ(hk ) = 0. k=1
k=1
Here, K
f (hk )fˆ(hk )
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
k=1
K−1 1 hk+1 − hk = f (hk ) f (hk−1 ) 1+λ hk+1 − hk−1 k=2
K λ hk − hk−1 f (hk+1 ) + f (hk )2 , + hk+1 − hk−1 1+λ
Parameter optimization — To select an optimal parameter λ, we use the so-called k-fold crossvalidation method [Bengio & Grandvalet, 2004; Kohavi, 1995]. More concretely, partition the training samples randomly but equally into k subsets, denoted by T1 , . . . , Tk . Then, select a subset Ti as the cross-validation set for testing the trained neural network, and use the data in the remaining subsets as the training data. After adding one neuron with the above-filtered activation function fˆλ for any given non-negative parameter λ, we train the augmented neural network until some default convergence criteria are satisfied, and test the trained network on the cross-validation set Ti . We thus get the cross-validation loss, denoted by LCV i (λ), for each λ. Using the method of grid search [Hsu et al., 2003] along some interval for λ and for all crossvalidation losses, one can get λ’s optimal value as
k=1
which actually has a unique zero point, i.e. λ = λ0 with K−1
λ0 =
− k=2 K
f (hk )
f 2 (hk )
hk+1 − hk f (hk−1 ) hk+1 − hk−1
k=1
hk − hk−1 f (hk+1 ) . + hk+1 − hk−1 The uniqueness of the zero point implies that the derivative in either Eq. (2) or Eq. (6) is nonzero almost surely if the parameter λ is randomly selected from any finite or infinite length of interval. Remark 3.1. Now, we go back to Example 3.1 and use the activation function which is filtered in the manner proposed above for the additionally-added neuron. Comparing the results shown in Figs. 6(b1) and 6(b2) reveals that the absolute value of the residual between the regression curve and the original signal g(x) are outstandingly reduced when some selected values of parameter λ are utilized.
Clearly, the selection of the parameter λ, not unique, is extremely essential to how well the overfitting problem can be solved. The following discussions focus on introducing a practical method for seeking an optimal value of λ.
λ∗ := arg min
k
λ∈[0,α],λ=λ0 i=1
LCV i (λ),
(14)
where α is an adjustable upper bound for λ in real applications. Remark 3.2. We continue to investigate the data as well as the model in Example 3.1 and Remark 3.1. According to the above scheme on parameter optimization, we randomly partition the training samples t(x) into ten subsets, and seek λ’s optimal value for the summation of the cross-validation as defined above. Thus, Eq. (14) becomes losses LCV i ∗ CV λ := argminλ∈[0,α],λ=λ0 10 i=1 Li (λ), where we set α = 2 and the grid search step-size for λ as 0.01. As numerically depicted in Fig. 7(a), the optimal value λ∗ for λ is approximately 0.64. To see the effectiveness of this optimal value to the promotion of the network’s generalization capability, for different values of λ, we calculate the average losses, respectively, over the training samples t(x) and testing data g(x) by Ks 1 (y p − tp )2 , Ls = 2Ks
s ∈ {train, test},
p=1
where Ktrain = 128 and Ktest = 368 as mentioned in Fig. 4. The variations of Ltrain and Ltest with respect to λ are shown in Fig. 7(b). These variations clearly reveal that as the activation function of the newlyadded neuron is filtered with the parameter λ taking its value in the vicinity of the optimal λ∗ [see the
1850068-14
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
(a)
(b) Pk CV Fig. 7. (a) The cross-validation loss approaches the minimum at λ = λ∗ ≈ 0.64, where k = 10 and (b) the i=1 Li variations of LTrain and LTest with respect to λ. Here, the best performance of the augmented neural network with the filtered ˜ ≈ 0.78, which is relatively close to the optimal value λ∗ ≈ 0.64. activation function on the testing data is achieved at λ ∗ Actually, in the vicinity of λ (see the shaded domain), the performance on the testing data as well as the generalization capability attains an outstanding level since the average loss Ltest takes its value in a very narrow and lower range, from 1.847 to 1.856.
shaded area in Fig. 7(b)], the augmented neural network after training has a much better performance on the testing data and a better generalization capability as well. All these illustrate the usefulness of the proposed scheme for seeking an optimal value of the filtering parameter λ.
4. Further Illustrative Examples on Classification Problems Before introducing the illustrative examples, we propose some configurations for modifying the above-proposed constructive learning algorithms. In real applications, using either or both configurations is likely to reduce the high computational cost as much as possible. Termination of neural network augmentation — Naturally, the procedure of adding new neurons is not endless but should be terminated by some
prescribed criteria. As a matter of fact, in the following examples, we implement the terminating criterion as follows: At each step where a new neuron is added and the correspondingly augmented network is trained, we compute the loss function along with the cross-validation set, which is obtained in an analogous manner as the sets Ti constructed in Sec. 3 for parameter optimization. If this loss function increases compared to the one obtained in the last step or it becomes less than some prescribed error bound, the current network is saturated and the procedure of adding neuron is terminated. Selecting assisting data as training samples — As a tremendous number of training samples are taken into account, the newly-added neuron with its activation function f , as constructed in the manner proposed above, includes a mixture of too many monotone intervals. This certainly increases the model complexity in applications. To avoid this complexity
1850068-15
May 21, 2018
16:17
WSPC/S0218-1274
1850068
L.-H. Fang et al.
to some extent, we, instead of using the whole training samples {X p }K p=1 to construct f , select a batch of assisting data, denoted by B, with a given size s for constructing f [Ioffe & Szegedy, 2015]. Here, K s and K stands for the number of training samples available. More specifically, after training the neural network with the whole training samples, we calculate the residual vector E element wisely as E(p) =
n
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
j=1
(y pj − tpj )2 ,
Table 1. The classification accuracy for the Iris flower dataset. Here, the original neural network and the networks with different numbers of added nonmonotonic neurons are used, and the cross-validation set T1 (15 samples) and the testing data (30 samples) are randomly selected. # of Nonmonotonic Neurons Added Accuracy on T1 Accuracy on testing data
0
1
2
3
4
0.6 0.8
0.8 0.9
0.9 0.9667
1.0 1.0
1.0 1.0
p = 1, 2, . . . , K
and we further permute all the elements of E in a descending order. Thus, we denote this permuˆ ˆ = [E(p)] tation vector by E 1≤p≤K , and then the assisting dataset B are selected such that for any ˆ ˆ X i ∈ B, the index i satisfies E(i) ≥ E(s). Now, we use the assisting dat B to construct f , operate on it by the average-smooth-filter, as designed in Sec. 3, and finally obtain fˆ. After the neural network is augmented by a neuron with the constructed and filtered fˆ, we can use a different batch of the assisting data from the whole training samples to train the newly-augmented neural network until some prescribed criteria are fulfilled. Now, we use several representative real-world datasets as illustrative examples to validate the proposed constructive learning algorithms and their modified versions. Example 4.1. We first consider the Iris flower dataset, in which 150 samples were from three Iris species (i.e. Iris setosa, Iris virginica, and Iris versicolor) and four features (i.e. the length and the width of the sepals and those of the petals) were measured for each sample with a measurement unit, centimeter [Fisher, 1936]. We set these four features together as the input of four dimensions in a neural network model, use two hidden layers of all sigmoidal neurons (five neurons in the first hidden layer and three neurons in the second), and use a softmax layer of three neurons as the output for classification. We randomly select 30 samples from the dataset as our testing data, another 15 samples as the cross-validation set T1 , and the remaining 105 samples as the training samples. As listed in Table 1, the original neural network, after being trained by the gradient descent method for 500 epochs with the training samples, shows a classification accuracy of 0.8 on the testing data only. Then, we start to add nonmonotonic
neurons for several stages as follows: at each stage, one neuron with an optimized value λ∗ from the cross-validation set T1 is added to the network, and then the augmented network is trained by the training samples for 500 epochs. Still listed in Table 1, after only three stages of adding three nonmonotonic neurons, the classification accuracy is exceptionally promoted to 100% on the testing data. However, when ten conventional sigmoidal neurons are added to the original network, the trained network only shows a classification accuracy of 0.9667 on the same testing data. All these validate the effectiveness of our constructive learning algorithms in deviating the trained neural network from the local minima and finally promoting the classification accuracy. We consider the heart disease dataset, a processed version of the Cleveland clinic dataset, from the UCI machine learning repository [Detrano et al., 1989; Aha & Kibler, 1988]. As usual in the literature, taken into account are 14 attributes from 76 raw attributes, that is, age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, maximum heart rate achieved, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (ranging from 0 to 3) colored by fluoroscopy, thal (normal, fixed defect, and reversible defect), and the diagnosis of heart disease used for prediction ranging from 0 to 4. Here, the former 13 attributes are used as the input of 13 dimensions in a neural network of two hidden layers (eight sigmoidal neurons in the first hidden layer and four in the second), and the last attribute, which corresponds to the absence (= 0) or presence (> 0) of heart disease, is represented by the output of four dimensions in the softmax layer of the network. Example 4.2.
1850068-16
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
(a)
(b)
Fig. 8. The classification performance for the Cleveland clinic heart disease dataset along the training epochs. Two types of augmented neural networks are used: (a) Conventional sigmoidal neurons are added and (b) nonmonotonic and filtered neurons are added. Here, the training samples (237 samples) and the testing data (60 samples) are randomly selected.
We randomly select 60 samples from the dataset as the testing data and set the remaining 237 samples as the training samples. After we use the gradient descent method to train the original network for 500 epochs with the training samples, we adopt, respectively, two ways to augment and train the network: (i) one way is to add four conventional sigmoidal neurons to the second hidden layer and train the augmented network for another 500 epochs, and (ii) the other is to add four nonmonotonic and filtered neurons with λ = 1 to the second layer and train the augmented network for another 500 epochs. For these two ways, the dynamical behaviors of the classification accuracy on the training samples and testing data are, respectively, shown by the curves in Fig. 8. Clearly, after the same training epochs, our constructive learning algorithms outstandingly promote the classification accuracy on both training samples and testing data, which also can be seen from Table 2. Example 4.3. We consider a dataset of car evaluations, which contains 1728 samples with six attributes: buying cost (very high, high, medium, or low), maintaining cost (very high, high, medium, or low), doors (2, 3, 4, 5 or more), seats (2, 4, or more), Table 2. The classification accuracy for the Cleveland clinic heart disease dataset after 1000 training epochs in Fig. 8. Types of Added Neurons
Sigmoid
Nonmonotonic
Accuracy on training samples Accuracy on testing data
85.23% 83.33%
86.50% 88.33%
luggage boot (small, medium, or big), safety (low, medium, or high), and a class value (unqualified, qualified, good, or very good) [Bohanec & Rajkovic, 1988, 1990; Zupan et al., 1997]. Here, the neural network is established in an analogous manner as that in Example 4.2. The only difference is located in the configuration of the network input, whose dimension is five here. We randomly select 346 samples as the testing data and set the remaining 1382 samples as the training samples. Similar to Example 4.2, we train the network for 500 epochs along with the training samples. Then, we adopt, respectively, two ways to augment and train the network for two stages. The first way includes the first stage of adding two sigmoidal neurons to the second hidden layer and training the augmented network for 500 epochs. It also includes the second stage of adding another two sigmoidal neurons to the second hidden layer and finally training the network for another 1500 epochs. The second way is to use the nonmonotonic and filtered neurons with λ = 1 to replace the added sigmoidal neurons in both stages of the first way. As shown in Fig. 9, the training procedure can be significantly reduced since the training accuracy for the second way at epoch 1000 equals approximately to that for the first way at epoch 2500. Also listed in Table 3, after 2500 epochs, the network augmented by the second way shows a better performance of accuracy on both training samples and testing data. Example 4.4. Finally, we consider the database
of Mixed National Institute of Standards and
1850068-17
May 21, 2018
16:17
WSPC/S0218-1274
1850068
L.-H. Fang et al.
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
(a)
(b)
Fig. 9. The classification performance for the car evaluation dataset along the training epochs. Two types of augmented neural networks are used: (a) Conventional sigmoidal neurons are added for two stages and (b) nonmonotonic and filtered neurons are added for two stages as well. Here, the training samples (1382 samples) and the testing data (346 samples) are randomly selected.
Technology (MNIST), which is a large database of handwriting digits including 60 000 training images and 10 000 testing images [Deng, 2012; LeCun et al., 1998]. The MNIST database is commonly and widely used for training various image processing systems [Kussul & Baidyk, 2004]. We establish a convolutional neural network with its input of 784-dimension, where the dimension number is that of a single flattened MNIST image [Lawrence et al., 1997; Krizhevsky et al., 2012]. The first convolutional layer uses 5 × 5 image patch with 32 groups of convolutional weights to compute 32 feature maps, and the second convolutional layer uses 5 × 5 image patch with 64 groups of convolutional weights to compute 64 feature maps. At each convolutional layer, the feature map computation is determined by a summation of convolving a given group of weights with the contiguous field in the input image and the bias, and then the obtained summation is operated by the standard ReLU function. After each convolutional layer, there is a subsampling layer applying 2 × 2 image patch to realize the max-pool technique. Following the second subsampling layer, we apply a fully-connected layer, where 1024 neurons with Table 3. The classification accuracy of the car evaluation dataset after 2500 training epochs in Fig. 9. Types of Added Neurons
Sigmoid
Nonmonotonic
Accuracy on training samples Accuracy on testing data
96.91% 96.32%
98.88% 97.69%
the weight matrix, the bias and the ReLU function are taken into account. To reduce the possible overfitting phenomenon, we employ the technique of dropout before setting the output layer [Hinton et al., 2012b]. Finally, an output softmax layer of ten outputs is utilized to make classifications. Here, we randomly select 5000 images from the training images as the cross-validation set, and set the remaining 55 000 images as the training samples. First, we train the above-established network for 20 000 epochs. At each training epoch, a batch of size 50 is utilized. At the end of 20 000 epochs, the accuracy on the testing images approaches 99.22%. Here and hereafter, the method of stochastic gradient descent [Bottou, 2004] is used in training epochs. Now, we add a new neuron connected with the inputs of the output softmax layer and the outputs of the dropout layer, and then train the augmented network for another 20 000 epochs. This procedure of adding one neuron and training the network lasts for three training rounds. Here, two types of neurons are taken into account. One type is of the conventional sigmoidal neurons, and the second type is of the nonmonotonic, filtered neurons. For each training round with the second type of neurons, we exploit a batch of assisting data B, as designed above, to construct the filtered activation function fˆ and then train the augmented network in the following epochs. As shown in Fig. 10(a), for the network which is augmented by the conventional sigmoidal neurons, the average classification accuracy on the testing
1850068-18
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
(a)
(b) Fig. 10. The classification accuracy for the MNIST dataset of handwriting digits. Two types of augmented neural networks are used: (a) Three conventional sigmoidal neurons are added consecutively at three stages and (b) three nonmonotonic and filtered neurons are added at corresponding stages. Here, the training samples containing 55 000 images are used to train the networks for the first 20 000 epochs, and the testing samples containing 10 000 images are used to test the classification accuracy.
images in the last 20 000 epochs approaches 99.26%. However, as shown in Fig. 10(b), for the network which is augmented by the nonmonotonic and filtered neurons, the corresponding classification accuracy is exactly promoted to 99.36%. For more
detailed classification accuracy, refer to Table 4. Since the elementary architecture of the convolutional neural network per se has been validated to be practically effective in the classification problems on the huge MNIST database, the accuracy
1850068-19
May 21, 2018
16:17
WSPC/S0218-1274
1850068
L.-H. Fang et al. Table 4. The detailed classification accuracy for testing the MNIST dataset by using two types of augmented neural networks. Here, the mean stands for the average accuracy over the corresponding epoch duration and the “std.” for the standard deviation for the mean.
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
Sigmoid Neurons
Nonmonotonic Neurons
# of Neurons Added
Epoch Duration
Mean (%)
Std.
Mean (%)
Std.
1
20000–25000 25000–30000 30000–35000 35000–40000
99.1766 99.2038 99.2222 99.2211 99.2059
0.00061 0.000656 0.000521 0.000589 0.000624
99.2328 99.2577 99.2687 99.2757 99.2587
0.000447 0.000462 0.000384 0.000359 0.000446
2
40000–45000 45000–50000 50000–55000 55000–60000
99.2238 99.2196 99.2321 99.2369 99.2281
0.000401 0.000400 0.000485 0.000433 0.000436
99.3007 99.3080 99.3416 99.3435 99.3235
0.000358 0.000431 0.000368 0.000330 0.000420
3
60000–65000 65000–70000 70000–75000 75000–80000
99.2583 99.2629 99.2460 99.2633 99.2576
0.000440 0.000413 0.000374 0.000334 0.000399
99.3615 99.3648 99.3621 99.3664 99.3637
0.000389 0.000460 0.000449 0.000378 0.000421
promotion of 0.1%, though seemingly minor, could be regarded as an outstanding advance in network training.
5. Concluding Remarks and Perspectives In this article, we, inspired partially by the physiological evidence of brain’s development, have proposed constructive learning algorithms by evolutionarily adding nonlinear neurons. The proposed algorithms have been analytically demonstrated, and also numerically validated by four representative benchmark datasets. These algorithms as well as their modified versions have several advantages for solving both regression and classification problems. These advantages include: (i) a certain reduction of the loss function after only a nonlinear neuron is delicately connected to the neural network, (ii) a relatively smaller network size, as compared to the conventional neural network for attaining accuracy at the same level, and (iii) a better generalization capability of coping with the overfitting problems. We hope and believe that the proposed algorithms could become one of the efficient alternatives for guiding the dynamics of the trained neural networks to escape from the local minima and for training small-scale networks with a tremendous amount of data.
Moreover, the algorithms developed in the article are of supervised learning, and the networks investigated are feedforward only. However, in real applications, training data are not abundantly accessible, and even they could be dependent on time evolution. Additionally, there are some other neuron types or other network structures for realization of dynamic artificial intelligence. Thus, the constructive algorithms for unsupervised learning [Hastie et al., 2009], reinforcement learning [Sutton & Barto, 1998], neurons with memristive or neuromorphic configurations [Hegab et al., 2015; Ren et al., 2017], and recurrent neural networks [Duane, 2017] are highly preferable and thus awaiting further development. All these, together with the applications of the developed algorithms to real-world datasets, become the major part of our present research topics.
Acknowledgments This work was supported in part by the NNSF of China under Grant Nos. 11322111, 11471081 and 61773125, and by the NSF of Shanghai under Grant No. 17ZR1444400.
References Adachi, M. & Aihara, K. [1997] “Associative dynamics in a chaotic neural networks,” Neur. Netw. 10, 83–98.
1850068-20
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
Brain-Inspired Constructive Learning Algorithms with Nonlinear Neurons
Aha, D. & Kibler, D. [1988] “Instance-based prediction of heart-disease presence with the Cleveland database,” Tech. Rep. ICS-TR-88-07, Depart. Inform. Comp. Sci., University of California, Irvine. Ash, T. [1989] “Dynamic node creation in backpropagation networks,” Proc. Int. Joint Conf. Neural Networks, pp. 365–375. Bengio, Y. & Grandvalet, Y. [2004] “No unbiased estimator of the variance of k-fold cross-validation,” J. Mach. Learn. Res. 5, 1089–1105. Bohanec, M. & Rajkovic, V. [1988] “Knowledge acquisition and explanation for multi-attribute decision making,” Proc. 8th Int. Workshop on Expert Systems and Their Applications, pp. 59–78. Bohanec, M. & Rajkovic, V. [1990] “Dex: An expert system shell for decision support,” Sistemica 1, 145–157. Bottou, L. [2004] Stochastic Learning, Lecture Notes in Artificial Intelligence, Vol. 3176 (Springer). Chen, L. & Aihara, K. [2000] “Strange attractors in chaotic neural networks,” IEEE Trans. Circuits Syst.I: Fund. Th. Appl. 47, 1455–1468. Crespi, B. [1999] “Storage capacity of non-monotonic neurons,” Neur. Netw. 2, 1377–1389. Deng, L. [2012] “The MNIST database of handwritten digit images for machine learning research,” IEEE Sign. Process. Mag. 29, 141–142. Detrano, R., Janosi, A. & Steinbrunn, W. [1989] “International application of a new probability algorithm for the diagnosis of coronary artery disease,” Amer. J. Cardiol. 64, 304–310. Du, K.-L. & Swamy, M. N. S. [2013] Neural Networks and Statistical Learning (Springer, London). Duane, G. S. [2017] “‘Force’ learning in recurrent neural networks as data assimilation,” Chaos 27, 126804. Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M. & Blau, H. M. [2017] “Dermatologist-level classification of skin cancer with deep neural networks,” Nature 542, 115–118. Farlow, S. J. [1984] Self-Organizing Methods in Modeling: GMDH Type Algorithms (Marcel Dekker, Inc., NY). Fisher, R. A. [1936] “The use of multiple measurements in taxonomic problems,” Ann. Eugen. 7, 179–188. Friedman, J. H. & Stuetzle, W. [1981] “Projection pursuit regression,” J. Amer. Statist. Assoc. 76, 817–823. Gallo, V. [2007] “Surprising synapses deep in the brain,” Nat. Neurosci. 10, 267–269. Goodfellow, I., Bengio, Y. & Courville, A. [2016] Deep Learning (The MIT Press, Cambridge). Hagan, M. T., Demuth, H. B., Beale, M. H. & Jes´ us, O. D. [2014] Neural Network Design, 2nd edition (Hagan Martin, Stillwater). Hassoun, M. H. [1995] Fundamentals of Artificial Neural Networks (The MIT Press, Cambridge).
Hastie, T., Tibshirani, R. & Friedman, J. [2009] The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition (Springer, NY). Haykin, S. [2009] Neural Networks and Learning Machines, 3rd edition (Prentice Hall, Upper Saddle River, NJ). Hegab, A. M., Salem, N. M., Radwan, A. G. & Chua, L. [2015] “Neuron model with simplified memristive ionic channels,” Int. J. Bifurcation and Chaos 25, 1530017. Hinton, G., Osindero, S. & Teh, Y. W. [2006] “A fast learning algorithm for deep belief nets,” Neur. Comput. 18, 1527–1554. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N. & Kingsbury, B. [2012a] “Deep neural networks for acoustic modeling in speech recognition,” IEEE Sign. Process. Mag. 29. Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. R. [2012b] “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv: 1207.0580. Hirose, Y., Yamashita, K. & Hijiya, S. [1991] “Backpropagation algorithm which varies the number of hidden units,” Neur. Netw. 4, 61–66. Hornik, K., Stinchcombe, M. & White, H. [1989] “Multilayer feedforward networks are universal approximators,” Neur. Netw. 2, 359–366. Hramov, A. E., Frolov, N. S. & Maksimenko, V. A. [2018] “Artificial neural network detects human uncertainty,” Chaos 28, 033607. Hsu, C.-W., Chang, C.-C. & Lin, C.-J. [2003] “A practical guide to support vector classification,” http:// www.csie.ntu.edu.tw/˜cjlin/papers/guide/guide.pdf. Ioffe, S. & Szegedy, C. [2015] “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” Proc. 32nd Int. Conf. Machine Learning, pp. 448–456. Islam, M. M., Sattar, M. A., Amin, M. F., Yao, X. & Murase, K. [2009] “A new adaptive merging and growing algorithm for designing artificial neural networks,” IEEE Trans. Syst. Man Cybern. Part B : Cybern. 39, 705–722. Kohavi, R. [1995] “A study of cross-validation and bootstrap for accuracy estimation and model selection,” Proc. 14th Int. Joint Conf. Artificial Intelligence, Vol. 2, pp. 1137–1145. Krizhevsky, A., Sutskever, I. & Hinton, G. [2012] “Imagenet classification with deep convolutional neural networks,” Proc. Adv. Neural Inform. Proc. Syst., pp. 1097–1105. Kukley, M., Capetillo-Zarate, E. & Dietrich, D. [2007] “Vesicular glutamate release from axons in white matter,” Nat. Neurosci. 10, 321–330.
1850068-21
May 21, 2018
16:17
WSPC/S0218-1274
1850068
Int. J. Bifurcation Chaos 2018.28. Downloaded from www.worldscientific.com by CITY UNIVERSITY OF HONG KONG on 05/30/18. For personal use only.
L.-H. Fang et al.
Kussul, E. & Baidyk, T. [2004] “Improved method of handwritten digit recognition tested on MNIST database,” Imag. Vis. Comput. 22, 971–981. Kwok, T.-Y. & Yeung, D.-Y. [1997] “Constructive algorithms for structure learning in feedforward neural networks for regression problems,” IEEE Trans. Neur. Netw. 8, 630–645. Lawrence, S., Giles, G. L., Tsoi, A. C. & Back, A. D. [1997] “Face recognition: A convolutional neuralnetwork approach,” IEEE Trans. Neur. Netw. 8, 98– 113. Lebiere, C. & Fahlman, S. E. [1990] “The cascadecorrelation learning architecture,” Proc. Adv. Neural Inform. Process. Syst., pp. 524–532. LeCun, Y., Cortes, C. & Burges, C. J. C. [1998] “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist. LeCun, Y., Bengio, Y. & Hinton, G. [2015] “Deep learning,” Nature 521, 436–444. Li, X., Xiang, S., Zhu, P. & Wu, M. [2015] “Establishing a dynamic self-adaptation learning algorithm of the bp neural network and its applications,” Int. J. Bifurcation and Chaos 25, 1540030-1–10. Lin, W. & Chen, G. [2009] “Large memory capacity in chaotic artificial neural networks: A view of the antiintegrable limit,” IEEE Trans. Neur. Netw. 20, 1340– 1412. Lu, W., Rossoni, E. & Feng, J. [2010] “On a Gaussian neuronal field model,” NeuroImage 52, 913–933. Ma, L. & Khorasani, K. [2004] “Facial expression recognition using constructive feedforward neural networks,” IEEE Trans. Syst. Man Cybern. Part B : Cybern. 34, 1588–1595. Ma, L. & Khorasani, K. [2005] “Constructive feedforward neural networks using hermite polynomial activation functions,” IEEE Trans. Neur. Netw. 16, 821– 833. Ma, J. F. & Wu, J. [2009] “Multistability and gluing bifurcation to butterflies in coupled networks with non-monotonic feedback,” Nonlinearity 22, 1383– 1412. Magoulas, G. D. & Vrahatis, M. N. [2006] “Adaptive algorithms for neural network supervised learning: A deterministic optimization approach,” Int. J. Bifurcation and Chaos 16, 1929–1950. Morita, M. [1993] “Associative memory with nonmonotone dynamics,” Neur. Netw. 6, 115–126. Nara, S. [2003] “Can potentially useful dynamics to solve complex problems emerge from constrained chaos and/or chaotic itinerancy?” Chaos 13, 1110–1121. Nara, S. [2008] “Novel tracking function of moving target using chaotic dynamics in a recurrent neural network model,” Cogn. Neurodyn. 2, 39–48.
Parekh, R., Yang, J. & Honavar, V. [2010] “Constructive neural-network learning algorithms for pattern classification,” IEEE Trans. Neur. Netw. 11, 436–451. Platt, J. [1991] “A resource-allocating network for function interpolation,” Neur. Comput. 3, 213–225. Ren, G., Zhou, P., Ma, J., Cai, N., Alsaedi, A. & Ahmad, B. [2017] “Dynamical response of electrical activities in digital neuron circuit driven by autapse,” Int. J. Bifurcation and Chaos 27, 1750187-1–9. Rivals, I. & Personnaz, L. [2003] “Neural-network construction and selection in nonlinear modeling,” IEEE Trans. Neur. Netw. 14, 804–819. Sharma, S. K. & Chandra, P. [2010] “Constructive neural networks: A review,” Int. J. Engin. Sci. Technol. 2, 7847–7855. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L. & van den Driessche, G. [2016] “Mastering the game of go with deep neural networks and tree search,” Nature 529, 484–489. Sutskever, I., Vinyals, O. & Le, Q. V. [2014] “Sequence to sequence learning with neural networks,” Adv. Neur. Inform. Proc. Syst. 27, 3104–3112. Sutton, R. S. & Barto, A. G. [1998] Reinforcement Learning: An Introduction (The MIT Press, Cambridge). Trappenberg, T. [2010] Fundamentals of Computational Neuroscience (Oxford University Press, Oxford). Vromen, T. G. M. & E. Steur, H. N. [2016] “Training a network of electronic neurons for control of a mobile robot,” Int. J. Bifurcation and Chaos 26, 16501961–16. Wu, X., Rozycki, P. & Wilamowski, B. M. [2015] “A hybrid constructive algorithm for single-layer feedforward networks learning,” IEEE Trans. Neur. Netw. Learn. Syst. 26, 1659–1668. Yanai, H. F. & Amari, S. I. [1996] “Auto-associative memory with two-stage dynamics of nonmonotonic neurons,” IEEE Trans. Neur. Netw. 7, 803–815. Yang, S.-H. & Chen, Y.-P. [2012] “An evolutionary constructive and pruning algorithm for artificial neural networks and its prediction applications,” Neurocomputing 86, 140–149. Yoshizawa, S., Morita, M. & Amari, S. [1996] “Autoassociative memory with two-stage dynamics of nonmonotonic neurons,” IEEE Trans. Neur. Netw. 7, 803–815. Ziskin, J. L., Nishiyama, A., Rubio, M., Fukaya, M. & Bergles, D. E. [2007] “Vesicular release of glutamate from unmyelinated axons in white matter,” Nat. Neurosci. 10, 321–330. Zupan, B., Bohanec, M. & Bratko, I. [1997] “Machine learning by function decomposition,” Proc. ICML 1997, pp. 421–429.
1850068-22