IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
1415
The Q-Norm Complexity Measure and the Minimum Gradient Method: A Novel Approach to the Machine Learning Structural Risk Minimization Problem D. A. G. Vieira, Member, IEEE, Ricardo H. C. Takahashi, Vasile Palade, Senior Member, IEEE, J. A. Vasconcelos, and W. M. Caminhas
Abstract—This paper presents a novel approach for dealing with the structural risk minimization (SRM) applied to a general setting of the machine learning problem. The formulation is based on the fundamental concept that supervised learning is a bi-objective optimization problem in which two conflicting objectives should be minimized. The objectives are related to the empirical training error and the machine complexity. In this paper, one general -norm method to compute the machine complexity is presented, and, as a particular practical case, the minimum gradient method (MGM) is derived relying on the definition of the fat-shattering dimension. A practical mechanism for parallel layer perceptron (PLP) network training, involving only quasi-convex functions, is generated using the aforementioned definitions. Experimental results on 15 different benchmarks are presented, which show the potential of the proposed ideas. Index Terms—Complexity measure, multiobjective training algorithms, neural networks, parallel layer perceptron (PLP), regularization methods, structural risk minimization (SRM).
I. INTRODUCTION N RECENT YEARS, the machine learning community focused its attention on the generalization problem by using complexity control as a main tool. Especially, after the introduction of the support vector machines (SVMs) [1]–[4], the theoretical and practical aspects of measuring and controlling complexity have been strongly developed. From a theoretical point of view, the introduction of the Vapnik and Chervonenkis (VC) dimension [1] and, afterwards, the fat-shattering dimension [5] have been the greatest advances. The SVM performs the complexity control through the maximization of the separation hyperplane margin, i.e., the minimization of the VC dimension. Another popular way to use complexity control in machine learning is the weight decay (WD) technique [6], which is a realization of the regularization method [7], [8]. The WD, which is
I
Manuscript received October 30, 2006; revised June 1, 2007 and January 1, 2008; accepted January 16, 2008. First published July 15, 2008; last published August 6, 2008 (projected). This work was supported by the CNPq under Grants 350902/1997-6 and 140009/2004-3, by the CAPES-COFECUB, Brazil, Project Cooperation 318/00-II, and by the CAPES under Grant 3421/04-0. D. A. G. Vieira, J. A. Vasconcelos, and W. M. Caminhas are with the Department of Electrical Engineering, Federal University of Minas Gerais, Belo Horizonte, MG 31270-010, Brazil (e-mail:
[email protected]; joao@cpdee. ufmg.br;
[email protected]). R. H. C. Takahashi is with the Department of Mathematics, Federal University of Minas Gerais, Belo Horizonte, MG 31270-010, Brazil (e-mail: taka@mat. ufmg.br). V. Palade is with the Computing Laboratory, Oxford University, Oxford OX1 3QD, U.K. (e-mail:
[email protected]). Digital Object Identifier 10.1109/TNN.2008.2000442
mainly applied to multilayer perceptrons (MLPs) [9], is based on the idea of controlling the norm of the network weights to improve its generalization. The neural network in this case is trained using a weighted sum of the training error and the norm of the weight vector. However, as will be shown in this paper, this algorithm is not capable to generate the whole set of the best possible solutions of the bi-objective problem. This is due to the fact that convexity cannot be guaranteed in the WD formulation, a main condition in the regularization formulation [7]. To overcome this drawback, some novel algorithms based on multiobjective optimization concepts have been developed for training MLPs [10]. The same concepts have been used later to generate a sliding-mode control training method [11] and have been successfully applied to the parallel layer perceptron (PLP) topology [12], [13]. In the case of radial basis function (RBF) network [9], the complexity control has been employed too, using another definition of complexity. Girosi et al. [14] performed the minimization of the network output high frequencies, in order to achieve smoother functions, and they also used a weighted sum formulation of the regularization method. The common structure that is shared by the aforementioned methods has not been fully recognized up to now. This paper sketches a unifying analysis that suggests a new complexity measure, defined in terms of a mathematical -norm, called here the -norm complexity measure. This formulation is a natural way to split the linear and nonlinear parameters influence in the learning process, leading to a formulation capable to uncouple the convex and nonconvex parts of the optimization problem associated with the learning procedure. This makes the consistent usage of a weighted sum approach possible to perform the learning process, which allows computationally efficient learning algorithms. A specific -norm measure is proposed here, based on the definition of the fat-shattering dimension. This measure of complexity is defined directly in the input space, instead of being defined in a given feature space, as would be the case of SVMs. Such a -norm has been derived for use with a new learning algorithm, the minimum gradient method (MGM), which, in turn, has been derived using the concept of the separation margin (resembling the SVMs). There would be alternative routes for deriving the MGM: for instance, an argument of frequency shaping of the network filtering properties could be used instead, among others. It will be shown in this paper that there are some interesting relationships between the choice of the kernel and the
1045-9227/$25.00 © 2008 IEEE
1416
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
matrix in the complexity measure proposed here. As highlighted in [15], there are many reasons mandating for a deeper understanding of the connection between the complexity (and the decision rule itself) in the input and feature space. The formulation presented here can be applied to many types of machine learning algorithms and topologies. In this paper, the focus is on the training of the PLP, since, for this topology, the problem can be solved using quadratic functions only, by using the -norm decomposition. This leads to a least-squareslike closed-form solution. For the PLP network, the gradient is written as a -norm, which is to be minimized. This paper is structured as follows. First, a general overview of the structural risk minimization (SRM) problem and its interpretation as a bi-objective optimization problem is presented. Afterwards, a simple single variable problem is presented, in which the standard weighted sum approach fails. Next, the main characteristics of some machine learning techniques that apply this concept are discussed. The -norm formulation for the complexity measure is presented in Section IV, and it is followed by the proposal of its particular case, in the MGM. A practical mechanism for the PLP is derived based on this approach. Several experimental results are presented in Section VIII, which are compared with the results obtained by some relevant algorithms in the literature. II. STRUCTURAL RISK MINIMIZATION PROBLEM It has been theoretically observed that the upper bound of the machine expected risk (real risk) is a function of two factors: the empirical risk (training error) and the VC dimension (capacity measure) [1], [4], which can be generalized as the fat-shattering dimension. With the probability over random samples, that is an the expected risk is bounded by the function inverse function of , and a direct function of the empirical risk and the capacity (complexity) , as shown in (1). Thus, by decreasing , the expected risk upper bound is also decreased. This can be done1 by increasing the number of samples or and decreasing (1) For a general class of real functions, the margin can be defined as follows. Consider the class of real functions in the input for a classification problem with threshold at 0. The space of an example , in which margin are the input vector and the desired output, respectively, in relation to a function , can be defined [3] as given in (2) This is a margin definition in the function space. It can be observed that corresponds to a correct classification of the given example. The margin distribution of in relation to the training set , in which is the size of , can be defined as (3) 1This is the worst case theory that aims to optimize the bounds, not the expected risk itself.
The smallest margin in a given set related to the margin distribution
is an important measure
(4) In some machine learning schemes, like SVMs, the maximum of the smallest margin, defined as (5) is used as a capacity (complexity) measure.2 This maximum is computed for a fixed structure, over the set of all valid parameter values for such structure. Using the definition in (2), a generalization of the VC dimension can be constructed—the fat shattering [16]. This can be used to prove that, in order to decrease the expected risk upper [5], [17]. bound, it is required to increase the margin The VC and fat-shattering dimension are capacity measures based on the margin. Following these considerations, Vapnik [18] proposed the SRM principle, which considers two factors: the minimization of the empirical risk and of the machine capacity (complexity). The VC dimension has been used as the capacity measure. , in which and Consider the function are arbitrary spaces and denotes the field of real numbers. as a parameter constrained Taking its second argument to a set , a set of functions becomes . This set can be structured as a sequence of defined for , such that nested subsets (6) The sequence (6) should fulfill the following conditions. • The VC dimension of each set is finite, and (7) • For any positive integer , there is a finite positive scalar such that and . The principle of SRM is oriented to find the values of and such that , making the function minimize the empirical risk, while the set minimizes the structural risk. A. Multiobjective Interpretation of SRM The method of SRM can be interpreted as a multiobjective optimization problem (in fact, a bi-objective problem), which searches for the tradeoff solutions between two objectives that are conflicting. An obvious generalization is to consider that the set is parameterized by a continuous parameter , instead of the integer index . Given a training set , the SRM problem for this set can be written as (SRM)
(8)
represents some empirical risk function and in which is the complexity of the learning machine, for instance, the VC dimension. 2The
complexity increases as decreases.
VIEIRA et al.: THE
-NORM COMPLEXITY MEASURE AND THE MGM
1417
Usually, it is not possible to minimize and simultaneously, because the optimum to one function hardly ever is the optimum to the other one. Thus, there is not a single optimum, but a set of them, when a multiobjective formulation is considered. In order to state the solutions of the SRM, the following definitions are required. dominates another 1) Dominance: A pair pair , which is denoted by , if and , with the strict inequality valid for at least one of the functions. is called Pareto op2) Pareto optimality: A pair timal (PO) if there is no other feasible pair that dominates it. By using these definitions, it is possible to generate the set of solutions called PO front, which have the best tradeoff between the error and the machine complexity. All such solutions are candidate solutions for the SRM problem. Examining (6) and (7) from the viewpoint of the Pareto optimality of (12), it can be seen that in the nested sequence , the minimal empirical error in the set is , while ordered as the complexity is ordered as . This , with the solutions of means that the functions
subject to:
(9)
are PO ones, each one associated with the corresponding se. These are the solutions of the SRM problem. quence set Any other function that is not a solution of any minimization problem of this form must be dominated and cannot be a solution of the SRM problem. This reinterpretation can be very useful for deriving machine learning schemes based on the SRM concept. This idea is further exploited in this paper. Finally, a particular form of the structural risk function associated with some particular sets can be defined by (10)
Fig. 1. Example illustrating the concept of PO. Axis x represents the com. It can be seen that plexity and axis y represents the empirical risk R
(w ) < (w ) and R (w ) > R (w ), thus the learning machine w . It can also be seen that represented by w does not dominate w , w w , w w , w w , w w , and w w , i.e., there is no w relation of dominance among w , w , and w . The machine w dominates w , w , and w w , thus it is not interesting to have the machines w and w w , since they are worse in both attributes. The PO front in this example is thus PO = w ;w ;w , which are the machines of interest in this paper.
6
6
f
6
6
6
6
g
B. The Weighted Sum Approach to Nonconvex Problems The next step here is to show that the well-known weighted sum approach for dealing with multiobjective problems is not directly suitable to find the solutions of SRM problem. The weighted sum approach is based on the definition of the functional (13) controls the importance of the objectives. The where issue is that, by varying , it would not be possible to generate the PO solutions (the SRM solutions) via the minimization of . This, in fact, is a standard result from multiobjective optimization theory that can be explained as follows. Consider, for instance, the following nonconvex unimodal one-variable functions:
and
(14) (11)
Given , this choice of and preserves the following necessary relations: ; • ; • . • is equivAs the minimization of the structural risk alent to the minimization of the norm of , the SRM principle becomes, in this case, stated in terms of only (SRM):
(12)
A graphical example of these multiobjective concepts using five learning machines is shown in Fig. 1.
(15) Given, for instance, and weighted sum functions can be written:
, the following
(16) (17) The initial functions and and two possible weighted functions and are shown in Fig. 2. The first noticeable thing is that the weighted functions have become multimodal, although the original functions were unimodal. Second, the two local opand correspond to the intima of the weighted functions dividual minima of and (this occurs no matter what value
1418
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
Fig. 2. Original functions are presented in continuous line ( —). Two possible weighted solutions for this problem, with = 0:7 and = 0:4 are shown, respectively, in ( ) and ( ).
000
is the MLP output relative to the th sample and where and are the weights corresponding to the output and hidden layers, respectively. Several algorithms that consider only the minimization of the have been proposed for this neural network empirical risk topology: backpropagation [19], quick propagation [22], and the Levenberg–Marquardt algorithm [23]. Nevertheless, in [24] and [17], it has been observed that the fat-shattering dimension upper bound for an MLP is limited by its weight vector norm, thus the fat-shattering dimension can be bounded by limiting the network weights. This is a theoretical reason for using algorithms that consider both the minimization and the weight vector norm, as in the case with WD of method [6] and of a family of algorithms based on multiobjective optimization techniques [10], [11] that uses the norm of the weight vector in the complexity control. The WD is based on penalizing networks with high values of the weight vector norm. This can be written as in
010
(19)
of is chosen). Therefore, the minimization of and can and . generate no other solution than the minimization of and . In the These facts are due to the nonconvexity of case of convex functions, the variation of would generate the whole Pareto set of the problem of simultaneous minimization of such functions. The conclusion that is relevant here is as follows: if and are not convex functions (which is the case in most of machine learning problems), the weighed function approach should not be employed for trying to find the solutions of SRM problem. The next sections will present some machine learning techniques that apply the SRM principle. This principle will be used to define a complexity measure, in terms of a -norm. The -norm is a way to uncouple the convex and nonconvex parts of the learning problem, raising the possibility of using the weighted sum approach in the convex part, while preserving the property of solving the SRM problem. This will be used for deriving a new learning procedure called MGM. III. MACHINE LEARNING TECHNIQUES THAT APPLY THE SRM PRINCIPLE In this section, some classical techniques to improve the generalization by using the complexity (capacity) control are reviewed. A. Multilayer Perceptron The MLP became a very popular machine learning method since the introduction of the backpropagation algorithm [19] and the proof of its universal approximation capabilities [20], hidden neu[21]. The output of this topology, considering rons and inputs, can be written as
where represents the tradeoff between the regularization term (note that it is composed of the weights and ) and the . empirical risk The multiobjective-optimization-based algorithms are defined in [10] and [13] as (20) In this formulation, the structural risk is chosen as the . value , which defines the set The formulation of (20) is a realization of the problem stated chosen as in (10). In fact, in this paper in (9), with the set the WD method is also a multiobjective approach where the solutions are generated using a weighted sum of the objectives as previously presented, with the limitations presented previously in this paper. As the complexity measure is already convex, since the Euclidean norm of the weights vector is used, these approaches would be equivalent if the empirical risk were a convex function for a given data set [25]. However, this is unlikely to occur, which means that the WD method fails in finding the whole PO set of solutions.3
B. Radial Basis Function Networks RBF networks are composed of three layers: the input layer where the input variables are connected, one hidden layer of high dimensionality, and the output layer which contains one linear neuron [26]. RBF networks are also universal function approximation techniques [27]. The RBF network output is computed as (21)
(18) 3This
does not imply that the algorithm will fail, but that it can fail.
VIEIRA et al.: THE
-NORM COMPLEXITY MEASURE AND THE MGM
1419
where is a set of functions known as radial basis functions (Gaussian, multiquadrics, etc.) and ’s are the function centers. In [14], the authors proposed to control the complexity of an RBF network by controlling its output smoothness. The smoothness has been defined in the frequency domain in such a way that a smoother function should have less energy in the highfrequency components. The high-frequency energy can be measured by passing the function through a high-pass filter and comnorm of the result. This can be expressed by puting the
sion. The proofs in [17] show that it is enough that the norm of the weights of the output space are limited, for the fat-shattering dimension to become bounded. In fact, the minimization of the norm of the output weight vector is equivalent to constructing an optimal separation hyperplane in the space defined by the hidden layer, similar to the SVM procedure, given some constraints in the weights of the activation function. -NORM COMPLEXITY MEASURE
IV.
where is a unit step function. For this class of functions, the VC . More exactly, only vectors in general dimension is positions can be separated. Consider the following hyperplane, called the classification hyperplane with margin, instead of the one given by (23):
In order to control the complexity of the learning machine, the first question to be addressed is what is the machine complexity and how to measure it. In the previous sections, some techniques that use the ideas of complexity control have been presented. From a theoretical point of view, the main tool used today for this purpose is the VC dimension and its generalizations. In practical terms, any useful complexity measure should be defined in terms of the problem input vectors and machine parameters. Moreover, quadratic functions are more convenient for being used, since they are simpler to be optimized and often lead to closed-form expressions. For instance, MLPs use the Euclidean norm of the weight vector, while SVMs use the norm of the hyperplane normal vector. The hyperplane normal vector is strongly connected with the weight vector; it is, in fact, a weight vector to a linear model. For any norm and any bijective linear transformation , a new . For instance, norm of can be defined to be equal to in 2-D, with , a rotation by 45 , and a suitable scaling, this changes the 1-norm into an -norm. Consider the Euclidean norm of the transformed vector
(24)
(28)
(22) where represents the Fourier transform of and is a highpass filter, i.e., a positive function that tends to zero as . In order to solve this problem, a weighted function, as in the regularization problem, has been used [7]. C. Support Vector Machines Consider the following set of functions in only two values:
that can take
(23)
This classifies the input vectors if if
as
for
(25)
For this case, considering a set of vectors in the sphere with radius , the set of hyperplanes with margin has the VC limited by the expression in the right-hand part dimension of (26) In this case, the VC dimension of a hyperplane can be decreased. This fact is used to build the SVMs, where the capacity of learning (complexity) is controlled by using the margin [2], [3]. In fact, the margin to a linear classifier is an inverse function of the weight norm vector, therefore this should be minimized. The complexity term for SVMs can be written as (27) More exactly, the SVMs maximize the margin, i.e., minimize the weight vector norm in a feature space, which makes a nonlinear mapping of the input space to another space where the linear classification can be performed. This is done by using kernels. It is worth noting that the weight norm vector in the case of MLPs represents some upper bound to the fat-shattering dimen-
,
a vector with finite dimension, and a symmetric positive definite4 matrix, i.e., . The -norm of is given by . The complexity measure problem can be written as a -norm. Such complexity measure can be equivalently defined as (29) All the aforementioned techniques can be rewritten in the -norm form. For the MLPs and SVMs, and using the weight will be vector norm as a complexity measure, the matrix the identity matrix. By changing , other methods can be generated. This formulation can be adapted for delivering a standard definition of complexity for many different learning machines, by spotting their differences in the matrix . Inspired by kernel methods, such as RBFs and SVMs, the complexity measure can be defined with an explicit dependence on the linear parameters, and with the dependence on the nonlinear parameters (the kernel parameters) implicitly stated, with matrix taken as a function of these nonlinear parameters. Consider a learning maand nonlinear parameter chine with linear parameter vector vector . The complexity measure becomes (30)
Q > 0 is more precisely stated as w Qw > 0, 8w 6= 0.
4
1420
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
In this way, a decomposition between the linear and nonlinear problem is achieved. Note that this type of decomposition is the main achievement of the kernel-based methods that aim to control the generalization properties by controlling only the linear part of the learning machines (the nonlinearities are left to the kernel functions). In the following sections, a new learning machine, called the MGM, is defined. This complexity measure is afterwards written as a -norm to decompose the influences of the linear and nonlinear parameters in the learning process of the PLP network. The idea of using the gradient as a complexity measure is derived from the concept of margin maximization, and it is afterwards related to other interesting properties, as the minimization of the high-frequency energy. V. MINIMUM GRADIENT METHOD It has been observed that the effective VC dimension depends on the concept of margin between functions as stated in (2)–(4). These ideas are fundamental to construct the fat shattering, one of the generalizations of the VC dimension. This margin should be maximized to guarantee the best bound to the expected risk. The necessary condition to achieve this is by making the derivatives of (2) equal to zero5
coefficient vector . The margin of the hyperplane, where is is the learning machine output, the desired output and can be written as (35) and its gradient norm is (36) which is a result equivalent to the linear SVMs case. In fact, the hyperplane of maximum margin is the one with minimum gradient norm. Since this is a linear problem, the gradient represents a projection of a given point in the separation curve, and it is the shortest distance in the variable space when using the Euclidean norm. The derivation presented in (35) and (36) has been made in the function space and, in this problem, the results are equivalent in both spaces. B. Nonlinear SVMs This section discusses how the matrix , which is an identity matrix for the linear SVMs, behaves when the kernel trick is applied [1]. Consider the primal formulation of the nonlinear SVMs7
(31)
(37)
(32) For a given set of
(38)
samples, this problem can be written as where the separation hyperplane is given by (33) (39)
Note that (33) has been written in terms of not the number of samples . It has been done because the gradient can be computed in points that are not in the training set. In the algorithm presented here, is used as the training and the test samples together.6 In this way, the gradient norm can be used to compute the complexity for a given machine learning problem. Consequently, a general SRM problem can be formulated as a bi-objective optimization problem as follows: (34)
In general, the matrix is defined as the identity matrix. Consider as an arbitrary symmetric definite–positive matrix. Thus, there exists a matrix such that . Thus, the objective function can be written as (40) Making rewritten as
The aim of this problem is to achieve the equilibrium between the two factors, which are, in general, conflicting. This equilibrium is responsible for the machine generalization abilities. The , gradient has been calculated in relation to the input space , but the control is made in the function parameters . A. Optimum Separation Hyperplane Considering the previously described MGM framework, the optimum separation hyperplane can be reconsidered. A general hyperplane can be defined as in (23) and, as aforesaid, the maximum separation is equivalent to minimizing the norm of the 5In these equations, the index t refers to a sample and the index i to one of the components of the input vector x. 6A semisupervised learning could be used to add more points in the complexity evaluation. For the moment, to avoid overfitting, this paper uses both training and test set, but extra points could be easily added. This will be studied later.
and
, the SVM training can be
(41) (42) Thus, there is a linear transformation of the vectors to vecbetors in which the margin maximization problem comes a general problem given , a definite–positive matrix. Indeed, the choice of the kernel function implies a complexity control measure, i.e., the matrix . This result can be used to transform an SVM, based on a given kernel, in another one based on a different kernel. Moreover, it can be used to generate different machines with the same complexity control behavior.
y
index t indicates a sample in the set S , thus x and are vectors and is scalar.
7The
VIEIRA et al.: THE
-NORM COMPLEXITY MEASURE AND THE MGM
1421
Fig. 3. Two linear functions of different slopes as separation functions. The values 1 and 1 represent the classes to be separated.
0
+
noticing that, in the cases where the Fourier transform exists, using higher order derivatives is equivalent to making new definitions of the filtering process. For the th derivative, is obtained, hence the filtering is made using . Moreover, the optimum hyperplane can also be viewed as a smooth function whose smoothness is defined as the minimization of the high-frequency energy using the gradient filter. In fact, as shown in [29], there is a bilateral equivalence between the RBFs generated using the high-frequency control and the SVMs solutions; bilateral because given an RBF, it is possible to generate an equivalent SVM and vice versa. Thus, different approaches can achieve similar solutions. In [30], the generalization bounds using the stochastic sensitivity measure was studied. This conin a given neighborhood siders an output perturbation being correlated with the gradient measure as the neighborhood tends to zero. B. Sine Function
VI. INTERPRETATIONS OF MGM This section shows that minimizing the gradient leads to a minimization of the sensitivity of the decision rule—this is a main fact behind the method proposed here. The first (trivial) fact to be stressed is illustrated in Fig. 3. This figure shows that a linear function of smaller slope provides a larger separation margin between two classes. The minimization of the function gradient can be interpreted as a way for maximizing the class separation in an -dimensional space.8
In [4], it has been shown that by choosing an appropriate coefficient , it is possible to approximate appropriately chosen . The points of any function bounded by ( 1, 1) using definition of the VC dimension is based on any set of points, thus has infinite VC dimension even tough it the function has only one free parameter. Given the set of functions (45) the points on the line (46)
A. Decreasing the High-Frequency Energy Another possible interpretation of the gradient minimization is in terms of function high-frequency energy minimization. As aforesaid, in [14], the minimization of the function smoothness for RBFs has been proposed, by defining smoothness as the energy of the high-frequency components of the approximated function. To study this problem and to show some other properties of the MGM consider the -dimensional Fourier transform
for any given can be shattered. It is sufficient to choose the parameter as (47) given that the data is separated in two classes determined by the sequence (48)
with respect to The derivative of Fourier transform for almost all :
(43) has the following
Increasing the number of points requires an increment in , i.e., a higher frequency function is used. Thus, the VC dimension of a sine function is closely related to the filtering process described in (22). Consider the following complexity measure: (49)
(44) The term represents the energy, and it is multiplied by . Therefore, the bigger is, the bigger the its frequency is. Thus, a high-pass filtering is deamplification of term fined, provided that (44) can be integrated [28]. It is also worth 8Note
that general functions can be approximated as piecewise linear ones.
This complexity measure is the norm of th derivative. Notice , hence (49) is equal to . that Therefore, limiting the derivative will limit the frequency and the VC dimension as can be observed in (47), for a sinusoidal model. C. Decreasing the Weight Vector Norm In the classical WD approach, the norm of the weight vector as a complexity measure is used. Consider an MLP with sig-
1422
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
moidal activation function . This function has an almost linear behavior when its arguments tend to zero if
(50)
Thus, the MLP output, where and are the output and hidden weights, respectively, can be written as (51) Hence, the gradient norm, to be minimized, is equivalent to (52) From this result, it is clear that the function will have a “smooth” behavior for small weights, given the almost linear region of operation. This is in accordance with empirical observations. Nevertheless, even when the linearization is considered, the norm of the weight vector is not a very representative measure for a given , the of complexity. For instance, given values of are not relevant since this neuron is not connected to the output, fact that is considered in (52). In the classical WD formulation, this phenomenon is not taken into account. In fact, the weights are in different domains, thus putting them in the same norm that is to be minimized can give rise to undesirable effects. This result is in line with the observation presented in [24] that uses a bound in the fat-shattering dimension. This complexity measure is more intuitive than the simple weight vector norm, since it considers that there is some relationship between the weights in the hidden and output layers. In [31], the minimization of the bound of th-order derivative as complexity measure has been considered. It has been shown that the th-order smoothing integral given by
Fig. 4. PLP topology.
a more efficient algorithm. The convexity issue is taken into account in this paper and a robust closed-form solution to the problem is proposed. VII. MINIMUM GRADIENT METHOD FOR THE PARALLEL LAYER PERCEPTRON In this section, the MGM is applied for training a PLP. It is shown that a -norm form can be used to model the network complexity. The main reason for choosing the PLP instead of an MLP is that it is computationally faster [12]. The output of a PLP with inputs and parallel neurons is calculated as
(53) where form by
(55)
is a weighting function is bounded in the global where (54)
This result is in line with the previous results of this paper that show that the output and hidden vectors have different influence in the regularization. Thus, they cannot be viewed as a single vector. Moreover, this result also points out that there is no selfevident choice of the derivative order . In [32] and [33], the second-order derivative, the Hessian matrix, has been used in the complexity control. The methods proposed in the aforementioned papers do not consider the convexity issue and use the weighted sum as a realization of the regularization problem. This leads, as discussed previously in this paper, to nonconsistent results in the complexity control problem. This paper has chosen to use the first-order derivative since, as long as there is no theoretical evidence in favor of higher order approach, it is simpler to implement, delivering
(56) (57) Usually, the same functions and are defined for all neuand . In the above, , rons, i.e., , and are general functions (hyperbolic tangent, linear, and are the components of the weight Gaussian, etc.), is the th input of the th sample, matrices and , is the perceptron bias, and is the th component of the output vector . The PLP architecture is presented in Fig. 4. Further details concerning the PLP topology, including its computational cost compared to other standard topologies, are presented in [12] and [34].
VIEIRA et al.: THE
-NORM COMPLEXITY MEASURE AND THE MGM
1423
One particular case of this topology can be generated considand as identity functions. In this case, the netering work output is calculated as (58) and . The training aspects when where just the is considered are presented in [12]. As in the case is a linear function presented in (58), in which the output , the PLP output can be also written in a of the parameters , where matrix form. In this case, consider the vector . This vector is a transformation of the matrix to a vector with the same components, where , , and is the input space dimension. More exactly, is the following vector: (59) By calculating all the outputs of the nonlinear perceptrons, a , , can matrix , with components be constructed .. .
.. .
(60)
Therefore, the output of the PLP network can be written as (61) Thus, the empirical risk can be written as (62) It is worth noting that, in this case, the error is a quadratic function of the control variables—the vector . By calculating the derivative of (58) with respect to , one obtains [34]
can be computed as
Hence, the gradient of
(64) The derivatives in relation to the vector can be written in a vector form as follows: (65) To exemplify the construction of the matrix , where , consider the following cases, shown in (66) and (67) at the bottom of the page, when the derivatives in relation and are computed. to In the matrices , when , the columns related to the are composed of two terms, as can be noticed in the weights and in the second one of . In the other first column of columns, just one term is used. Considering the definition of the , the complexity function can be defined as (68) . where Clearly, , noticing that the sum of symmetric positive–definite matrices are also symmetric positive–definite matrices. In this case, the function stated in (62) is convex. In (62) and (68), the two objectives stated in (34)—minimization of the empirical error and complexity—are written as quadratic functions of . Therefore, the solution of the multiobjective optimization problem can be written as a weighted combination of the two objectives without loosing any potential solution [13]. This can be written as
(69) where the optimum (i.e., ) can be calculated by making the derivative of (69) equal to zero. The derivative of (69) in relation to can be calculated as
(63)
.. .
.. .
(70)
.. .
(66)
.. .
(67)
1424
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
In order to find , the previous relation should be made equal to zero, as given in the following:
(71) is nonsingular. This training alif the matrix gorithm has been developed in Matlab [35]. Instead of inverting , it has the system of (71) using the standard command been used the command that solves the system iteratively. This is more robust and faster than inverting the system (see [35]). The algorithm computational complexity scales with the . number of free parameters The PO set can be found by varying between zero and one. In this paper, the has been found using the golden section algorithm in the validation error criteria. The validation error for the given formulation is a convex function of the linear parameters . is a function of the nonlinear neurons, i.e., The matrix . The complexity is written as , where the influences of the linear and nonlinear parameters are decomposed, i.e., the nonlinearities are used only to define the matrix . This paper does not investigate how to define the values of the weights , because it is a complex nonlinear problem. This problem is equivalent to the definition of the feature space for SVMs. As will be shown in Section VIII, even though no complexity control (except the definition of the number of nonlinear parameters itself) is used in the nonlinear part, good results are obtained. The nonlinear parameters are used to build the matrix , thus they will count in the network complexity. In conclusion, the novel algorithm for the parallel layer perceptron topology based on the minimum gradient method (PLPMGM) can be performed in two steps. 1) Train an overdimensioned PLP to minimize the empirical risk. 2) Perform the minimization of the gradient norm in the network linear parameters using the quadratic expression presented in (71). This procedure is equivalent to generating a network for which the solutions can be found, and constraining it by using the gradient as a learning capacity (complexity) measure. Even though in the proposed algorithm no control is applied to the nonlinear part of the network, its effect is compensated, whatever it is, by the linear parameter adjustment procedure. The main differences of the mechanism proposed in this paper compared with the ones in [32] and [33] are as follows: 1) the use of the first-order derivative instead of the second order; 2) the update of only the convex part of the problem; 3) a closed-form algorithm instead of an iterative one; and 4) a convex validation error in relation to the linear parameters . Item 1) assures simplicity in the implementation since, as previously discussed; there is no any available result that suggests that using higher order derivatives, which are much harder to compute and formulate, is more adequate to the general complexity control problem. Item 2) guarantees the consistency of
the weighted sum approach and makes possible the derivation of an efficient closed-form solution to the problem, stated in item 3). The main advantage of having unimodal validation error is the consistency of the cross-validation procedure which, in this case, has its optimum well defined. A relaxation method to guarantee convexity in the validation set for some learning machine is introduced in [36] and it is closely related to the proposed ideas. A. Comparison of the PLP-MGM With Other Standard Machine Learning Techniques This section presents the characteristics, advantages, and disadvantages of the PLP-MGM proposed in this paper compared with some well-known machine learning techniques. Consider the PLP using two parallel neurons and , as given in the following: (72) , an MLP with one hidden layer and an RBF netGiven work can be written making a projective basis, hyperbolic tangent, for instance, or as an RBF, such as a Gaussian function, respectively. Indeed, defining the nonlinear parameters, i.e., in the PLP is equivalent to the cases of RBF or MLP. Thus, in this paper, why did we decide to use a PLP instead of the previous ones? The PLP used in this paper considered as a linear . Given a PLP with parallel neupolynomial, i.e., , the number of linear parameters is , rons and instead of that would be the case in the MLPs or RBFs. This gives more flexibility to the complexity control process as long as the network is less dependent on the nonlinear parameters. The use of a linear layer appears to be a good tradeoff between the complexity control and the computational effort. Indeed, a general polynomial (73) (74) could be used [37]. The problem of using the result of (74) as one parallel neuron is the fact that the number of free parameters grows exponentially with the dimension of the problem. The number of parallel neurons to a polynomial of order is , leading to the “curse of dimensionality.” We are currently investigating the use of higher order polynomials in the polynomial layer. The MLPs and RBFs are a particular case when using a constant function [12]. It is clear that the proposed model can also be extended to the parameters estimation of autoregressive signals as in [38]. For instance, for training one MLP with MGM, the result presented in (71) can be used with new matrices and . In fact, an MLP with one hidden layer can be viewed as a particular case instead of the topology presented in (58), using of . Thus, a PLP with a zeroth-order polynomial in the linear layer is equivalent to an MLP with one hidden layer.
VIEIRA et al.: THE
-NORM COMPLEXITY MEASURE AND THE MGM
1425
Fig. 5. PO front from the problem stated in (34), the tradeoff between the empirical risk and the gradient norm. To decrease the training error, the gradient must increase, and it is the tradeoff of interest in this paper.
The computational complexity of the PLP-MGM scales with the product of the number of neurons and the input space dimensionality as stated in (71). This is the main shortcoming of the proposed technique, especially when compared with kernel methods as SVMs. The SVMs scale with the number of data samples and are insensitive to the input dimension. However, for some problems, it is easier to obtain samples than to extract relevant features as in many electromagnetic problems [34]. For this class of problems, it is more adequate to use the PLP-MGM than kernel methods. Both methods are very competitive, as shown in the experimental part of this paper. In applications where the classification runtime speed is mandatory, as in the National Institute of Standards and Technology (NIST, Gaithersburg, MD) handwritten benchmark, which has 60 000 samples, using methods that scale with the dimensionality can be interesting [39]. The classification with a classical SVM is substantially slower when compared to neural networks. A simple example is given below to clarify the PLP-MGM presented in this section. B. PLP-MGM 1-D Example To exemplify the behavior of the proposed method, a 1-D regression problem stated in (75) with corrupted white noise is used with zero-mean and variance of
Fig. 6. Desired target function, the training and validation sets, and the PLP outputs for the conventional and proposed technique. The PLP regularized with the MGM technique presented a result smoother and closer to the desired function y .
increased: this is the tradeoff of interest in this paper. The desired function, the training and validation sets, the PLP output trained with a single objective algorithm, and the PLP regularized using the MGM technique are shown in Fig. 6. First, a conventional PLP is trained using all the data aiming at minimizing the training error (PLP output in Fig. 6) and it clearly overfits. Afterwards, using the nonlinear parameters obtained in the conventional training, the linear parameters are defined using (71), where the patterns in the training set are used to define the matrix , the ones on the validation set to define , and both to define . The output of the PLP regularized with the MGM approach is also shown in Fig. 6 and it is smoother and close to the desired output stated in (75). Closeups in some part of Fig. 6 are shown in Figs. 7 and 8. One important feature of this method is that the validation error is a convex function of the linear parameters, and for this case, it is unimodal (just one minimum) when parameterized in terms of (shown in Fig. 9). This is a remarkable fact that shows that in the PO front the validation error has a unique minimum that must correspond to the best network. The results presented in this section aim to provide a better insight into the nature of the learning process proposed in this paper. Results obtained on 15 benchmarks are presented in Section VIII. VIII. EXPERIMENTAL RESULTS
(75) A PLP network with 15 parallel neurons have been used in and gradient this simulation. The PO front (in terms of norm) is shown in Fig. 5. As only quadratical functions defined the problem, the PO front is convex. This PO front is parameterized in terms of and its extremes, and the minimum of and are defined as equal to 1 and 0, respectively. In this front, to decrease the training error, the complexity must be
In this section, experimental results on some classical machine learning benchmarks and algorithms are presented. Sigmoidal logistic functions have been used as PLP nonlinear activation function. The data have been used as given in the benchmarks; no extra normalization has been applied. The nonlinear parameters have been defined using the standard fivefold cross validation, which has been also used in the definition of . The model selection has been performed separately in each trial to avoid selection bias.
1426
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
TABLE I IDA REPOSITORY DATA SET SUMMARY
A. IDA Repository Problems Fig. 7. Closeup of Fig. 6.
Fig. 8. Closeup of Fig. 6.
Fig. 9. Validation error in terms of to the PO networks. Given the formulation of the MGM method, this is always convex in terms of the linear parameters and unimodal in term of , thus is unique and well defined.
All the 13 data sets from Intelligent data Analysis (IDA) repository [40] are considered here. The dimensionality of the input space, the number of training and test samples, and the number of realizations for each data set are summarized in Table I. The results obtained by the PLP-MGM are compared with the results obtained by using the following machine learning techniques: • SVM; • kernel Fisher discriminant (KFD); • RBF; • AdaBoost (AB); • regularized AdaBoost (ABR) extracted from [41]; • leave-one-out SVM (LOOM) [42]; • leave-one-out KFD (LOO-KFD); • -fold KFD (Xval-KFD) [43]; • generalized least absolute shrinkage and selection Operator (LASSO) [44]; • Bayesian KLR [45]; • posterior probability SVM (PPSVM) [46]; • single objective PLP [12]. These results are shown in Tables II–IV where “NA” means “not available.” The number of parallel neurons (NPN) used by PLP and PLP-MGM for each data set is given in the last lines of the tables. The first noticeable result of Tables II–IV is that the PLP-MGM has outperformed the conventional PLP in most of the tested example, and that PLP has never outperformed PLP-MGM. It is clear as well that the PLP-MGM has achieved similar results to those produced by the other approaches used for comparison. A statistical comparison of the results must then be used [47]. Only the methods that presented results to all data sets are going to be considered in this comparison, disregarding the standard deviation. First, all algorithms are ranked for each data set separately, the best performing algorithm getting rank of 1, the second best 2, and so on. In case of ties, average ranks are assigned. The average ranks are presented in Table V and it shows the PLP-MGM as the one with best rank. Average ranks by themselves provide a fair comparison of the algorithms as pointed out in [47]. However, the performance
VIEIRA et al.: THE
-NORM COMPLEXITY MEASURE AND THE MGM
1427
TABLE II IDA REPOSITORY RESULTS I
TABLE V AVERAGE RANK CONSIDERING THE IDA REPOSITORY PROBLEMS
, which means that the PLP-MGM is only statistically superior to AB. In fact, a higher number of data sets would be required to figure out if there is a real difference among the studied techniques. The number of neurons in the PLP-MGM has been set using cross validation, but the nonlinear parameters have been randomly initialized in the algorithm. We are currently working on algorithms to automatically optimize the nonlinear parameters, to the given problem and by defining an “optimal” matrix number of neurons. In this way, we expect to find less variance in the PLP-MGM results. This variance has been usually higher than in other methods. From [41], we have the following:
TABLE III IDA REPOSITORY RESULTS II
Due to the careful model selection performed in the experiments, all kernel-based methods exhibit similarly good performance. Note that we can expect such a result since they use similar implicit regularization concepts by employing the same kernel.
TABLE IV IDA REPOSITORY RESULTS III
of two classifiers is significantly different if the corresponding average ranks differ by at least a critical difference (76) where is the number of algorithms and is the number of data sets. For further details, see [47]. In the present case,
This quotation reinforces the ideas presented in this paper, especially the result in (41). The careful selection of kernels is a very important issue. This is a nonautomated part of the training algorithm. Once the kernels are well chosen, the machine learning algorithm is expected to achieve good results. For SVM and KDF techniques, the complexity control is made only in the linear parameters. Although the PLP-MGM does not control the nonlinear parameters, it uses the information associated with them in the complexity control process. More specifically, the classical SVMs control only the linear part of the problem. As the feature space is a nonlinear mapping, any hyperplane can be mapped to a maximum margin in some feature space, as shown in [48]. The achievement of the SVMs and related kernel techniques is that it separates the easily solvable linear problem from the complex nonlinear one. The same feature is achieved using the proposed framework that makes , where are linear possible to write the complexity as parameters and are the nonlinear ones. This makes the solution of the linear problem possible without trying to adjust the nonlinear parameters. It makes the learning machine less sensitive to the nonlinear parts. The cross-validation technique was adopted in this paper and in [41] to select the nonlinear parameters. In recent years, some attempts have been made to solve this problem. For example, in [49], a technique to limit the number of nonzero coefficients that appear in the SVM expansion was presented. In [50], a method to modify the kernel function to
1428
TABLE VI COMPARISON OF PLP-MG, SVM-S, SVM-T, SVM, AND KFA FOR THE HEART DISEASE DIAGNOSIS PROBLEM
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
TABLE VII CREDIT ANALYSIS PROBLEM RESULTS
IX. CONCLUSION improve the performance of the SVMs for classification problems was proposed. The use of cross validation was studied in [51]. In [3], a method to adapt the SVM kernels was proposed. Hybrid kernels were proposed in [52]. The PLP-MGM does not change the nonlinear part, but since it is taken into account in the complexity factor, good generalization results can be obtained as observed in the IDA repository problems. B. Heart Disease Diagnosis In [53], a variation of the soft-margin SVMs (SVM-S) to the total-margin SVMs (SVM-T) was proposed. One of the problems tested in this paper is a heart disease diagnosis using the Cleveland Clinic Foundation (Cleveland, OH) data set. The proposed PLP-MGM is compared with the results presented in [53] and [54]. Unfortunately, the deviation was not presented there. The results generated in this work consider a mean of 100 simulations. These results are presented in Table VI and they show the generalization power of the proposed technique. “NA” means “not available” and KFA stands for kernel function approximation. The proposed technique (PLP-MGM) has achieved the second best results for this problem as can be seen in Table VI. The feature space used in the techniques employed for comparison has been determined using cross validation. For PLP, the nonlinear part has been automatically generated during the training process. C. Credit Analysis Problem A data set for a credit analysis problem is available on the University of California at Irvine (UCI) Machine Learning Repository. The data set consists of 690 samples with 15 inputs. The training set contains 75% of the total number of samples, and the remaining samples are used for test purposes. A complete description and analysis of this problem can be found in [55]. In Table VII, it is shown that the PLP-MGM outperformed the other techniques used for comparison. The hybrid technique using RBF and genetic algorithm achieved a similar result but, as described in [55], it tends to be slow due to the genetic algorithm characteristics.
In this paper, the -norm complexity measure and an instance of its realization within a machine learning algorithm, the MGM, were presented. Other meaningful definitions for the matrix possibly can be generated using either some theoretical motivation or a priori knowledge. The concept of fat-shattering dimension was shown to be related to the proposed complexity measure. The MGM was applied to develop a training algorithm for the PLP. This structure was chosen because its number of linear parameters is larger than the number of linear parameters of MLP, which favors the application of the proposed least squares training procedure. The same idea can be applied to MLP and RBF networks, by controlling their output weights, to ANFIS [66] and other types of neurofuzzy networks [67], by controlling the consequents of the rules, as well as to several other methods, as kernel methods, including simple polynomial approximation problems. In the last case, we already performed some successful tests. The other possible derivations of the ideas presented in this paper are currently under study. The choice of controlling only the linear parameters was taken with the aim of allowing the network training to be performed via fast (noniterative) and reliable least squares methods. In this case, the multiobjective problem is convex and the validation error is unimodal in parameter . The multiobjective framework used to write the SRM problem is useful to bring a clearer insight into the nature of the problem. All the machine learning techniques discussed here can be studied using this background. For instance, the formulation presented in [10] and [13] for MLPs is similar to SVMs formulation. In all cases, the tradeoff solutions are generated by constraining one of the two objectives defined here. The SVMs bound the error while minimizing the complexity and the MLPs bound the complexity while minimizing the error. The use of derivatives as a regularization term has been already widely used, for instance, to cubic splines, which are based on second-order derivatives. In this work, a new insight on this usage was presented, as well as a novel efficient algorithm, derived from the presented considerations. The results presented here show the good performance of the proposed approach. Moreover, they were generated without any interference of the designer, i.e., the whole learning process was con-
VIEIRA et al.: THE
-NORM COMPLEXITY MEASURE AND THE MGM
ducted automatically. Future developments to be performed include finding efficient algorithms to control the nonlinear part of the network using the proposed measures. The matrices and are functions of the nonlinear parameters . Therefore, it can be possible to adjust in such a way as to decrease, simultaneously, both factors that are important to the machine generalization. The problem studied here can also be interpreted in terms of controlling the bias and the variance [68] in such a way that by bounding the derivatives, some bias is inserted and the variance is reduced [13]. REFERENCES [1] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [2] C. Cortes and V. Vapnik, “Support vector networks,” Mach. Learn., vol. 20, pp. 273–279, 1995. [3] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods). Cambridge, U.K.: Cambridge Univ. Press, 2000. [4] V. N. Vapnik, The Nature of Statistical Learning Theory (Statistics for Engineering and Information Science), 2nd ed. New York: SpringerVerlag, Sep. 2001. [5] J. Shawe-Taylor and P. L. Bartlett, “Structural risk minimization over data-dependent hierarchies,” IEEE Trans. Inf. Theory, vol. 44, no. 5, pp. 1926–1940, Sep. 1998. [6] G. E. Hinton, “Connections learning procedures,” Artif. Intell., vol. 40, pp. 185–234, 1989. [7] A. N. Tikhonov, “On solving ill-posed problem and the method of regularization,” Doklady Akademii Nauk USSR, vol. 153, pp. 501–504, 1963. [8] A. N. Tikhonov and V. Y. Arsenin, Solution of Ill-Posed Problems. Washington, DC: W. H. Winston, 1977. [9] S. Haykin, Neural Networks: A Comprehensive Fundation, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 1999. [10] R. A. Teixiera, A. P. Braga, R. H. C. Takahashi, and R. R. Saldanha, “Improving generalization of MLPs with multi-objective optimization,” Neurocomputing, vol. 35, no. 1–4, pp. 189–194, 2000. [11] M. A. Costa, A. P. Braga, B. R. Menezes, R. A. Teixiera, and G. G. Parma, “Training neural networks with a multi-objective sliding mode control algorithm,” Neurocomputing, vol. 51, pp. 467–473, 2003. [12] W. M. Caminhas, D. A. G. Vieira, and J. A. Vasconcelos, “Parallel layer perceptron,” Neurocomputing, vol. 55, no. 3–4, pp. 771–778, Oct. 2003. [13] D. A. G. Vieira, W. M. Caminhas, and J. A. Vasconcelos, “Controlling the parallel layer perceptron complexity using a multiobjective learning algorithm,” Neural Comput. Appl., vol. 16, no. 4-5, May 2006, DOI: 10.1007/s00521-006-0052-z. [14] F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neural networks architectures,” Neural Comput., vol. 7, pp. 219–269, 1995. [15] B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. J. Smola, “Input space vs. feature space in Kernel-based methods,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 1000–1017, Sep. 1999. [16] M. J. Kearns and R. E. Schapire, “Efficient distribution-free learning of probabilistic concepts (abstract),” in Proc. 3rd Annu. Workshop Comput. Learn. Theory, 1990, pp. 382–391. [17] P. L. Bartlett, “The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 525–536, Mar. 1998. [18] V. N. Vapnik, “Principles of structural risk minimization for learning theory,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 1992, vol. 4, pp. 831–838. [19] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal Representations by Error Propagation, ser. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart and J. L. McClelland, Eds. Cambridge, MA: Bradford Books (MIT Press), 1986, vol. 1. [20] G. Cybenko, “Approximation by superpositions of a Sigmoid function,” Math. Control Signals Syst., vol. 2, pp. 303–314, 1989. [21] K. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Netw. Signals Syst., vol. 2, pp. 183–192, 1989.
1429
[22] S. E. Fahlman, “Fast-learning variations on back-propagation: An empirical study,” in Proc. Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Senjnowski, Eds., San Mateo, CA, 1988, pp. 38–51. [23] M. T. Hangan and M. B. Menjah, “Training feedforward network with the Marquardt algorithm,” IEEE Trans. Neural Netw., vol. 5, no. 6, pp. 989–993, Nov. 1994. [24] P. L. Bartlett, “For valid generalization the size of the weights is more important than the size of the network,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 1997, vol. 9, pp. 134–141. [25] V. V. Vasin, “Relationship of several variational methods for approximate solutions of ill-posed problems,” Math Notes, vol. 7, pp. 161–166, 1970. [26] D. S. Broomhead and D. Lowe, “Multivariable functional interpolation and adaptive networks,” Complex Syst., vol. 2, pp. 321–355, 1989. [27] J. Park and I. W. Sandberg, “Universal approximation using radialbasis-function networks,” Neural Comput., vol. 3, pp. 246–257, 1991. [28] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Trans. Inf. Theory, vol. 39, no. 3, pp. 930–945, May 1993. [29] P. Andras, “The equivalence of support vector machine and regularization neural networks,” Neural Process. Lett., vol. 15, no. 2, pp. 97–104, 2002. [30] D. S. Yeung, W. W. Y. Ng, D. Wang, E. C. C. Tsang, and X.-Z. Wang, “Localized generalization error model and its application to architecture selection for radial basis function neural network,” IEEE Trans. Neural Netw., vol. 18, no. 5, pp. 1294–1305, Sep. 2007. [31] J. E. Moody and T. S. Rögnvaldsson, “Smoothing regularizers for projective basis function networks,” in Advances in Neural Information Processing Systems, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. Cambridge, MA: MIT Press, 1997, vol. 9, pp. 585–605. [32] H. Drucker and Y. LeCun, “Improving generalization performance using double back-propagation,” IEEE Trans. Neural Netw., vol. 3, no. 6, pp. 991–997, Nov. 1992. [33] C. M. Bishop, “Curvature driven smoothing: A learning algorithm for feedforward networks,” IEEE Trans. Neural Netw., vol. 4, no. 5, pp. 882–884, Sep. 1993. [34] D. A. G. Vieira, W. M. Caminhas, and J. A. Vasconcelos, “Extracting sensitivity information of electromagnetic devices models from a modified ANFIS topology,” IEEE Trans. Magn., vol. 40, no. 2, pp. 1180–1183, Mar. 2004. [35] Matlab Toolboxes. The Mathworks, Natick, MA [Online]. Available: www.mathworks.com [36] K. Pelckmans, J. A. K. Suykens, and B. De Moor, “Convex approach to validation-based learning of the regularization constant,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 917–920, May 2007. [37] D. Gabor, W. Wildes, and R. Woodcock, “A universal nonlinear filter, predictor and simulator which optimizes itself by a learning process,” Proc. Inst. Electr. Eng., vol. 108B, pp. 422–438, 1961. [38] Y. Xia and M. S. Kamel, “A generalized least absolute deviation method for parameter estimation of autoregressive signals,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 107–118, Jan. 2008. [39] C. J. C. Burges and B. Schölkopf, “Improving the accuracy and speed of support vector machines,” in Advances in Neural Information Processing Systems, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. Cambridge, MA: MIT Press, 1997, vol. 9, p. 375. [40] IDA, “IDA Benchmark Repository Used in Several Boosting, KFD and SVM Papers,” Tech. Rep. [Online]. Available: http://ida.first.gmd.de/ raetsch/data/benchmarks.htm [41] K. Muller, S. Mika, G. Ratsh, K. Tsuda, and B. Scholkopf, “An introduction to Kernel-based learning algorithms,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 181–201, Mar. 2001. [42] J. Weston and R. Herbrich, “Adaptive margin support vector machines,” in Advances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, Eds. Cambridge, MA: MIT Press, 2000, pp. 281–295. [43] G. C. Cawley and N. L. C. Talbot, “Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers,” Pattern Recognit., vol. 36, pp. 2585–2592, November 2003. [44] V. Roth, “The generalized LASSO,” IEEE Trans. Neural Netw., vol. 15, no. 1, pp. 16–28, Jan. 2004. [45] G. C. Cawley and N. L. C. Talbot, “The evidence framework applied to sparse kernel logistic regression,” Neurocomputing, vol. 65, pp. 119–135, March 2005. [46] Q. Tao, G.-W. Wu, Fei-Yue, and J. Wang, “Posterior probability support vector machines for unbalanced data,” IEEE Trans. Neural Netw., vol. 16, no. 6, pp. 1561–1573, Nov. 2005.
1430
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008
[47] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, 2006. [48] B. Zhang, Is the maximal margin hyperplane special in a feature space? Hewlett-Packards Labs, Palo Alto, CA, Tech. Rep., Apr. 2001. [49] B. Sholkopf, C. Burges, and V. Vapnik, “Extracting support data for a given task,” in Proc. 1st Int. Conf. Knowl. Disc. Data Mining, 1995, pp. 252–257. [50] S. Amari and S. Wu, “Improving support vector machine classifier by modifying kernel functions,” Neural Netw., vol. 12, pp. 783–789, 1999. [51] K. Duan, S. S. Keerthi, and A. N. Poo, “Evaluation of simple performance measures for tuning SVM hyperparameters,” Neurocomputing, vol. 51, pp. 41–59, 2003. [52] Y. Tan and J. Wuang, “A support vector machine with a hybrid kernel and minimal Vapnik-Chervonenks dimension,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 4, pp. 385–395, Apr. 2004. [53] M. Yoon, Y. Yun, and H. Nakayama, “A role on total margin support vector machines,” in Proc. Int. Joint Conf. Neural Netw., 2003, pp. 2049–2053. [54] G. Baudat and F. Anouar, “Kernel-based methods and function approximation,” in Proc. Int. Joint Conf. Neural Netw., Washington, DC, 2001, pp. 1244–1249. [55] E. Lacerda, A. Carvalho, A. P. Braga, and T. B. Ludermir, “Evolutionary radial basis functions for credit assessment,” Appl. Intell., vol. 22, no. 3, pp. 167–182, 2005. [56] S. P. Llyod, “Least square quantization in PCM,” IEEE Trans. Inf. Theory, vol. IT-28, no. 2, pp. 129–137, Mar. 1982. [57] M. A. Ismail, S. Z. Selim, and S. K. Aror, “Efficient clustering of multidimensional data,” in Proc. IEEE Int. Conf. Syst. Man Cybern., 1984, pp. 120–123. [58] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley-Interscience, 1973. [59] M. A. Ismail and M. S. Kamel, “Multidimensional data clustering utilizing hybrid strategies,” Pattern Recognit., vol. 22, pp. 75–89, 1989. [60] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math, 1967, vol. 1, pp. 281–297. [61] C. Chinrungrueng and C. H. Séquin, “Optimal adaptive k-means algorithm with dynamic adjustment of learning rate,” IEEE Trans. Neural Netw., vol. 6, no. 1, pp. 157–169, Jan. 1995. [62] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Computational Models of Cognition and Perception. Cambridge, MA: MIT Press, 1986, vol. 1, ch. 8, pp. 319–362. [63] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning architecture,” in Advances in Neural Information Processing Systems, D. S. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1990, vol. 2, pp. 524–532. [64] R. Parekh, J. Yang, and V. Honavar, “Constructive neural network learning algorithms for multi-category real-value pattern classification,” Dept. Comput. Sci., Iowa State Univ., Ames, IA, Tech. Rep., 1987. [65] B. Scholkopf, Support Vector Learning. New York: Springer-Verlag, 1997. [66] J. S. R. Jang, “ANFIS: Adaptive-network-based fuzzy inference systems,” IEEE Trans. Syst. Man Cybern., vol. 23, no. 3, pp. 665–685, May 1993. [67] T. Yamakwa, E. Uchino, T. Miki, and Kusanagi, “A neo fuzzy neuron and it applications to system identification and predictions to system behavior,” in Proc. 2nd Int. Conf. Fuzzy Logic Neural Netw., Japan, 1992, pp. 477–483. [68] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the bias-variance dilemma,” Neural Comput., vol. 4, no. 1, pp. 1–58, 1992.
D. A. G. Vieira (M’08) was born in Brazil in 1980. He received the B.Sc. and Ph.D. degrees in electrical engineering from Universidade Federal of Minas Gerais (UFMG), Belo Horizonte, Brazil, in 2003 and 2006, respectively. In 2005, he was a Visiting Researcher at Oxford University, Oxford, U.K., and in 2007, an Associate Research at Imperial College London, London, U.K. His main interests are in multiobjective optimization theory and applications, machine learning, and their interfaces.
Ricardo H. C. Takahashi was born in Brazil in 1965. He received the B.Sc. and M.Sc. degrees in electrical engineering from Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil, in 1989 and 1991, respectively, and the Ph.D. degree in electrical engineering from the University of Campinas (UNICAMP), Campinas, Brazil, in 1998. Currently, he is an Associate Professor at the Department of Mathematics, UFMG. His main research interests are in the fields of optimization theory (including the theory of evolutionary computation) and control theory.
Vasile Palade (M’02–SM’04) received the M.Eng. degree in computer engineering from the Technical University of Bucharest, Bucharest, Romania, in 1988 and the Ph.D. degree in computer engineering from the University of Galati, Galati, Romania, in 1998. Currently, he has been working at the Computing Laboratory, Oxford University, Oxford, U.K., since October 2001. Before joining Oxford University, he has worked with the Department of Engineering, University of Hull, U.K. and with the Department of Computer Science and Engineering, University of Galati, Romania. His main research interests are in the area of computational intelligence, encompassing hybrid intelligent systems, neural networks, fuzzy and neurofuzzy systems, various nature inspired algorithms (genetic algorithms, swarm optimization), and ensembles of classifiers. Application areas include a wide range of bioinformatics problems, fault diagnosis, web usage mining, process modeling and control, among others.
J. A. Vasconcelos was born in Monte Carmelo, Brazil. He received the Ph.D. degree in electrical engineering from Ecole Centrale de Lyon-France, Lyon, France, in 1984. Since 1985, he has been the Professor at the Electrical Engineering Department, Federal University of Minas Gerais, Belo Horizonte, Brazil. His research interests include linear and nonlinear optimization (evolutionary multiobjective optimization, stochastic and deterministic methods), computational intelligence, and computational electromagnetics (finite-element methods, boundary integral equation methods, and others). He is working on R&D projects in application and development of new technologies for electrical systems.
W. M. Caminhas received the Ph.D. degree in electrical engineering from University of Campinas, Sao Paulo, Brazil, in 1997. Currently, he is an Adjunct Professor at the Department of Electrical Engineering, Federal University of Minas Gerais, Belo Horizonte, Brazil. His research interests include computational intelligence and control of electrical drives.