Building a Robust Extreme Learning Machine for ...

7 downloads 0 Views 228KB Size Report
Keywords: Extreme Learning Machine, Moore-Penrose Generalized In- verse ..... curves of the ELM and the ROB-ELM classifiers in the presence of outliers, we.
Building a Robust Extreme Learning Machine for Classification in the Presence of Outliers Ana Luiza B. P. Barros1,2 and Guilherme A. Barreto2 1

2

Department of Computer Science, State University of Cear´ a Campus of Itaperi, Fortaleza, Cear´ a, Brazil [email protected]

Department of Teleinformatics Engineering, Federal University of Cear´ a Center of Technology, Campus of Pici, Fortaleza, Cear´ a, Brazil [email protected]

Abstract. The Extreme Learning Machine (ELM), recently proposed by Huang et al. [6], is a single-hidden-layered neural network architecture which has been successfully applied to nonlinear regression and classification tasks [5]. A crucial step in the design of the ELM is the computation of the output weight matrix, a step usually performed by means of the ordinary least-squares (OLS) method - a.k.a. Moore-Penrose generalized inverse technique. However, it is well-known that the OLS method produces predictive models which are highly sensitive to outliers in the data. In this paper, we develop an extension of ELM which is robust to outliers caused by labelling errors. To deal with this problem, we suggest the use of M -estimators, a parameter estimation framework widely used in robust regression, to compute the output weight matrix, instead of using the standard OLS solution. The proposed model is robust to label noise not only near the class boundaries, but also far from the class boundaries which can result from mistakes in labelling or gross errors in measuring the input features. We show the usefulness of the proposed classification approach through simulation results using synthetic and real-world data. Keywords: Extreme Learning Machine, Moore-Penrose Generalized Inverse, Pattern Classification, Outliers, M -Estimation.

1

Introduction

In recent years, there have been an ever increasing interest in a class of supervised one-hidden-layered neural network model, generically called Extreme Learning Machine (ELM), in which the input-to-hidden-layer weights are randomly chosen and hidden-to-output-layer are determinated analitically. Mainly due its fast learning speed and ease of implementation [5], several authors have been applying the standard ELM network (and sophisticated variants of it) to a number of complex pattern classification and regression problems [1, 4, 13–18]. It should be mentioned, however, that the aforementioned works have not consistently addressed the important issue of model performance in the presence of outliers, with the work of Horata et al. [4] being the only exception. As

a matter of fact, in recent years, it has been observed a growing interest in the development of neural network architectures which are robust to outliers, including proposals for designing RBF networks [10, 11], echo-state networks [12] and even ELM networks [4]. It is worth emphasizing that all these previous works (no exception!) approached the issue of robustness to outliers for regression-like problems, such as function approximation and time series prediction. However, in many realworld pattern classification problems, the labels provided for the data samples are noisy. There are typically two kinds of noise in labels. Noise near the class boundaries often occurs because it is hard to consistently label ambiguous data points. Labelling errors far from the class boundaries can occur because of mistakes in labelling or gross errors in measuring the input features. Labelling errors far from the boundary comprises a particular category of outliers [9]. Thus, in order to allow ELM-based classifiers to handle labelling errors efficiently, in this paper we propose the use of M -estimators [8], a broad framework widely used for parameter estimation in robust regression problems, to compute the weight matrix operator instead of using the ordinary least squares solution. We show through simulations on synthetic and real-world data that the resulting ELM classifier is very robust to this type of outliers. To the best of our knowledge, this is the performance of the ELM network as a pattern classifier is evaluated under the presence of outliers. The remainder of the paper is organized as follows. In Section 2, we briefly review the fundamentals of ELM in the context of pattern classification. Then, in Section 3 we describe the basic ideas and concepts behind the M -estimation framework and introduce our approach to robust supervised pattern classification using ELM. In Section 4 we present the computer experiments we carried out using synthetic and real-world datasets and also discuss the achieved results. The paper is concluded in Section 5.

2

Fundamentals of the ELM

Let us assume that N data pairs {(xµ , dµ )}N µ=1 are available for building and evaluating the model, where xµ ∈ Rp+1 is the µ-th input pattern3 and dµ ∈ RK is the corresponding target class label, with K denoting the number of classes. For the labels, we assume an 1-of-K encoding scheme, i.e. for each label vector dµ , the component whose index corresponds to the class of pattern xµ is set to “+1”, while the other K − 1 components are set to “-1”. Then, let us randomly select N1 (N1 < N ) data pairs from the available data pool and arrange them along the columns of the matrices D and X as follows: X = [x1 | x2 | · · · | xN1 ] and D = [d1 | d2 | · · · | dN1 ]. where dim(X) = (p + 1) × N1 and dim(D) = m × N1 . 3

First component of xµ is equal to 1 in order to include the bias.

(1)

The ELM is a single-hidden layer feedforward network (SLFN), proposed by [6], for which the weights from the inputs to the hidden neurons are randomly chosen, while only the weights from the hidden neurons to the output are analytically determined. Consequently, ELM offers significant advantages such as fast learning speed, ease of implementation, and less human intervene when compared to more traditional SLFNs, such as the MLP and RBF networks. For a network with p input units, q hidden neurons and C outputs, the i-th output at time step k, is given by oi (k) = β Ti h(k), (2) where βi ∈ Rq , i = 1, . . . , C, is the weight vector connecting the hidden neurons to the i-th output neuron, and h(k) ∈ Rq is the vector of hidden neurons’ outputs for a given input pattern x(k) ∈ Rp . The vector h(k) itself is defined as h(k) = [f (w1T x(k) + b1 ), . . . , f (wqT x(k) + bq )]T ,

(3) p

where bl , l = 1, . . . , q, is the bias of the l-th hidden neuron, wl ∈ R is the weight vector of the l-th hidden neuron and f (·) is a sigmoidal activation function. Usually, the weight vectors wl are randomly sampled from a uniform (or normal) distribution. Let H = [h(1) h(2) · · · h(N )] be a q × N matrix whose N columns are the hidden-layer output vectors h(k) ∈ Rq , k = 1, ..., N , where N is the number of available training input patterns. Similarly, let D = [d(1) d(2) · · · d(N )] be a C × N matrix whose the k-th column is the target (desired) vector d(k) ∈ RC associated with the input pattern x(k), k = 1, ..., N . Finally, let β = [β1 β2 · · · β C ] be a q × C matrix, whose i-th column is the weight vector βi ∈ Rq , i = 1, ..., C. Thus, these three matrices are related by the following linear mapping: D = β T H, (4) where the matrices D and H are known, while the weight matrix β is not. The OLS solution of the linear system in Eq. (4) is given by the Moore-Penrose generalized inverse as follows: −1  β = HHT HDT . (5) Eq. (5) can be split into C individual estimation equations, one for each output neuron i, being written as −1  β i = HHT HDTi , i = 1, . . . , C, (6) where Di denotes the i-th row of matrix D. In several real-world problems the matrix HHT can be singular, imparing the use of Eq. (5). In fact, a near singular HHT (yet invertible) matrix is also a problem, because it can lead to numerically unstable results. To avoid both problems, a common approach involves the use of the ridge regression method (a.k.a. Tikhonov regularization), which is given by β i = (HHT + λI)−1 HDTi ,

i = 1, . . . , C,

(7)

where the constant λ > 0 is the regularization parameter. As mentioned in the introduction, in what concerns the robustness of the ELM to outliers in classification problems, to the best of our knowledge, a comprehensive approach is still missing. Bearing this in mind, we propose the use of robust regression techiques to compute the output weight matrix, instead of the OLS approach. This approach is described in the next section.

3

Basics of M -Estimation

An important feature of OLS is that it assigns the same importance to all error samples, i.e. all errors contribute the same way to the final solution. A common approach to handle this problem consists in removing outliers from data and then try the usual least-squares fit. A more principled approach, known as robust regression, uses estimation methods not as sensitive to outliers as the OLS. Huber [7] introduced the concept of M -estimation, where M stands for “maximum likelihood” type, where robustness is achieved by minimizing another function than the sum of the squared errors. Based on Huber theory, a general M estimator applied to the i-th output neuron of the ELM classifier minimizes the following objective function: J(β i ) =

N X

ρ(eiµ ) =

µ=1

N X

ρ(diµ − yiµ ) =

µ=1

N X

ρ(diµ − βTi xµ ),

(8)

µ=1

where the function ρ(·) computes the contribution of each error eiµ = diµ − yiµ to the objective function, diµ is the target value of the i-th output neuron for the µ-th input pattern xµ , and βi is the weight vector of the i-th output neuron. The OLS is a particular M -estimator, achieved when ρ(eiµ ) = e2iµ . It is desirable that the function ρ possesses the following properties: Property Property Property Property

1 2 3 4

: : : :

ρ(eiµ ) ≥ 0. ρ(0) = 0. ρ(eiµ ) = ρ(−eiµ ). ρ(eiµ ) ≥ ρ(ei′ µ ), for |eiµ | > |ei′ µ |.

Parameter estimation is defined by the estimating equation which is a weighted function of the objective function derivative. Let ψ = ρ′ to be the derivative of ˆ , we have ρ. Differentiating ρ with respect to the estimated weight vector β i N X

T

ˆ xµ )xT = 0, ψ(yiµ − β i µ

(9)

µ=1

where 0 is a (p + 1)-dimensional row vector of zeros. Then, defining the weight function w(eiµ ) = ψ(eiµ )/eiµ , and let wiµ = w(eiµ ), the estimating equations are given by n X ˆ T xµ )xT = 0. wiµ (yiµ − β (10) i µ µ=1

Thus, solving the estimating equations P 2 2 corresponds to solving a weighted eiµ . least-squares problem, minimizing µ wiµ It is worth noting, however, that the weights depend on the residuals (i.e. estimated errors), the residuals depend upon the estimated coefficients, and the estimated coefficients depend upon the weights. As a consequence, an iterative estimation method called iteratively reweighted least-squares (IRLS) [2] is commonly used. The steps of the IRLS algorithm in the context of training the ELM classifier using Eq. (6) as reference are described next. IRLS Algorithm for ELM Training ˆ (0) using the OLS solution in Eq. (6). Step 1 - Provide an initial estimate β i Step 2 - At each iteration t, compute the residuals from the previous iteration eiµ (t − 1), µ = 1, . . . , N , associated with the i-th output neuron, and then compute the corresponding weights wiµ (t − 1) = w[eiµ (t − 1)]. Step 3 - Solve for new weighted-least-squares estimate of β i (t):   ˆ (t) = HW(t − 1)HT −1 HW(t − 1)DT , β i i

(11)

where W(t − 1) = diag{wiµ (t − 1)} is an N × N weight matrix. Repeat Steps 2 ˆ (t). and 3 until the convergence of the estimated coefficient vector β i Several weighting functions for the M -estimators can be chosen, such as the Huber’s weighting function:  k , if |eiµ | > k w(eiµ ) = |eiµ | (12) 1, otherwise. where the parameter k is a tuning constant. Smaller values of k leads to more resistance to outliers, but at the expense of lower efficiency when the errors are normally distributed. In particular, k = 1.345σ for the Huber function, where σ is a robust estimate of the standard deviation of the errors4. In a sum, the basic idea of the proposed approach is very simple: replace ˆ of the i-th output neuron described the OLS estimation of the weight vector β in Eq. (6) with the one provided by the combined use of the M -estimation framework and the IRLS algorithm. From now on, we refer to the proposed approach by Robust ELM classifier (or ROB-ELM, for short). In the next section we present and discuss the results achieved by the ROB-ELM classifier on synthetic and real-world datasets.

4

Simulations and Discussion

As a proof of concept, in the first experiment we aim at showing the influence of outliers in the final position the decision curve between two nonlinear separable 4

A usual approach is to take σ = MAR/0.6745, where MAR is the median absolute residual.

data classes. For this purpose, let us consider a a synthetic two-dimensional dataset generated according to a pattern of two intertwining moons (see Fig. 1). The ELM and the ROB-ELM classifiers are trained twice. The first time they are trained with the outlier-free data set with N = 120 samples. The second time, they are trained with Nout outliers added to the original dataset. It is worth mentioning that all data samples are used for training the classifiers, since the goal is to visualize the final position of the decision line and not to compute recognition rates. For this experiment, the Andrews weighting function was used for implementing the ROB-ELM classifier and the regularization constant required for implementing the standard ELM classifier was set to λ = 10−2 . Three hidden neurons with hyperbolic tangent activation functions were used for both classifiers. For the sake of fairness, the ELM and the ROB-ELM classifiers used the same input-to-hidden-layer weights, which were randomly sampled from a uniform distribution between (−0.1, +0.1). The default tuning parameter k of Matlab’s robustfit function was used. In order to evaluate the final decision curves of the ELM and the ROB-ELM classifiers in the presence of outliers, we added Nout = 10 outliers to the dataset and labelled them as beloging to class +1. The outliers were located purposefully far from the class boundary found for the outlier-free case; more specifically, at the decision region of class −1. The results for the training without outliers are shown in Fig. 1a, where as expected the decision curves of both classifiers are similar. The results for the training with outliers are shown in Fig. 1b, where this time the decision curve of the standard ELM classifier moved (bended) towards the outliers, while the decision line of the ROB-ELM classifier remained unchanged, thus revealing the robustness of the proposed approach to outliers. The dataset (with and without outliers) used in the first experiment available by the authors upon request. In the second and third experiments we aim at evaluating the robustness of the ROB-ELM classifier using a real-world dataset. For this experiment, four weighting functions (Bisquare, Fair, Huber and Logistic) were tested for implementing the ROB-ELM classifier and the regularization constant required for implementing the standard ELM classifier was set to λ = 10−2 . The default tuning parameter k of Matlab’s robustfit function was adopted for all weighting functions. In order to evaluate the classifier’s robustness to outliers we follow the methodology introduced by Kim and Ghahramani [9]. Thus, the original labels of some data samples of a given class are deliberately changed to the label of the other class. A benchmarking dataset was chosen (Ionosphere), which is publicly available for download from the UCI Machine Learning Repository website [3]. The Ionosphere dataset describes a binary classification task where radar signals target two types of electrons in the ionosphere. “Good” radar returns are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not, since their signals pass through the ionosphere. This dataset is comprised of 351 34-dimensional data points, and two classes (good and bad).

7

Class +1 Class −1 ROB−ELM

6

Standard ELM

Feature X2

5

4

3

2

1

0 −2

0

2

4 Feature X1

6

8

10

(a) Dataset without outliers.

7

Class +1 Class −1 ROB−ELM

6

Standard ELM

Feature X2

5

4

3

2

1 outliers

0 −2

0

2

4 Feature X1

6

8

10

(b) Dataset with outliers. Fig. 1. Decision curves of the standard ELM and the proposed robust ELM classifiers. (a) Dataset without outliers. (b) Dataset with outliers.

We labelled the data points of class Good (Ng = 225 samples) and class Bad (Nb = 126 samples) as +1 and −1, respectively. 50% of the available samples are randomly selected for training purposes. In addition, outliers are built by randomly selecting a certain percentage Pout of the training samples from Class

Table 1. Performance comparison of the ELM and ROB-ELM classifiers for the Ionosphere data set (Pout = 5%). q ELM (λ = 0.01) ROB-ELM (Bisquare) ROB-ELM (Fair) ROB-ELM (Huber) ROB-ELM (Logistic) 10 70.94±4.95 69.77±6.72 72.21±4.66 72.30±4.27 72.23±4.05 15 70.67±4.53 70.76±5.54 73.41±3.34 73.11±3.40 73.56±3.75 20 70.93±3.90 72.29±5.87 74.00±3.49 74.07±3.39 74.44±3.40 25 71.24±3.33 72.89±4.86 73.94±2.94 74.51±3.35 74.41±2.87 30 71.90±3.16 72.06±4.54 75.50±2.56 75.81±2.19 75.30±2.41 35 71.81±3.01 73.90±3.49 76.36±1.75 76.13±1.51 76.27±1.62 40 72.73±2.53 73.87±3.35 76.20±2.10 76.69±1.96 76.60±2.03 45 73.34±2.25 73.46±4.05 76.69±1.83 76.83±1.86 76.93±1.89 50 73.27±2.33 73.71±4.36 77.21±2.02 76.80±2.07 76.76±2.18 100 73.97±2.03 71.49±5.42 77.03±3.68 76.46±4.32 77.03±4.17

Table 2. Performance comparison of the ELM and ROB-ELM classifiers for the Ionosphere data set (Pout = 10%). q ELM (λ = 0.01) ROB-ELM (Bisquare) ROB-ELM (Fair) ROB-ELM (Huber) ROB-ELM (Logistic) 10 67.03±5.12 69.77±4.80 71.06±4.52 70.57±4.77 70.47±4.03 15 66.39±5.35 70.61±4.94 70.73±4.79 68.79±5.03 70.63±4.73 20 65.57±5.71 72.04±4.77 70.86±4.06 70.56±4.76 70.90±3.98 25 65.93±4.50 73.37±4.25 72.03±3.19 71.64±3.81 71.49±4.08 30 65.83±4.07 72.17±3.69 73.17±3.09 72.40±3.48 73.04±3.08 35 66.51±3.82 73.46±3.26 73.89±2.79 73.17±2.71 73.81±2.68 40 67.17±2.78 74.56±3.01 74.86±2.94 75.07±2.78 74.60±2.90 45 66.96±3.00 75.23±3.73 75.23±3.22 74.67±2.95 75.63±2.72 50 67.27±2.57 75.33±3.66 75.67±2.79 75.66±3.14 75.97±2.74 100 68.81±2.37 72.99±6.02 77.07±4.41 76.90±4.19 77.01±4.50

Table 3. Performance comparison of the ELM and ROB-ELM classifiers for the Ionosphere data set (Pout = 20%). q ELM (λ = 0.01) ROB-ELM (Bisquare) ROB-ELM (Fair) ROB-ELM (Huber) ROB-ELM (Logistic) 10 50.03±5.19 50.93±5.22 55.16±6.95 50.51±5.82 53.20±6.93 15 50.07±4.49 51.43±5.53 55.20±4.86 50.79±4.64 53.87±4.99 20 48.63±3.63 52.39±4.99 55.99±6.00 50.51±4.33 54.74±5.05 25 49.83±4.32 52.31±4.25 56.46±5.64 50.90±4.41 54.11±4.49 30 49.49±3.35 53.21±3.47 56.93±4.38 51.26±4.00 55.26±3.94 35 49.90±3.22 54.76±4.39 58.71±4.90 52.84±4.37 56.69±4.62 40 49.24±2.79 57.53±5.80 61.11±6.28 56.37±5.44 59.03±6.09 45 49.93±3.03 58.41±6.83 62.76±6.92 58.69±6.73 61.04±5.87 50 50.16±3.02 61.93±6.68 64.51±7.83 60.76±7.40 63.61±6.82 100 51.43±3.18 68.94±7.89 69.91±6.97 67.03±9.39 68.39±7.67

+1 and changing their labels to class −1. The testing set is outlier-free since the goal of the experiment is to evaluate the influence of outliers in the construction of the decision borders of the classifiers. For this purpose, we evaluate the performances of the ELM and ROB-ELM classifiers for Pout = 5%, 10% and 20% and for different values of q (number of hidden neurons). The results are given in Tables 1, 2 and 3. In these tables, we show the values of the classification rates and the corresponding standard deviations averaged over 100 training/testing runs.

By analyzing the results, we can verify firstly that the performances of all variants of the proposed ROB-ELM classifier tend to improve with an increase in the number of hidden neurons. Secondly, the performances deteriorate with an increase in the number of outliers, as expected. As a major result one can easily verify that the performances of the ROBELM classifier is better than the standard ELM, specially when using the Fair, Huber and Logistic weighting functions. While the improvements in the performances of the ROB-ELM classifier are higher for higher values of q (number of hidden neurons), there are no significant improvements in the standard ELM classifier when q increases, specially for Pout = 10% and 20%. As a final remark, it is worth mentioning once again that the excellent performances of the proposed ROB-ELM classifier were achieved using the default values of the tuning parameter k of Matlab’s robustfit function for all weighting functions used in this paper. This is particularly interesting for the practitioner who wants to obtain fast and accurate results without spending much time in long fine-tuning runs of the classifier.

5

Conclusion

In this paper we introduced a robust ELM classifier (ROB-ELM) for supervised pattern classification in the presence of labeling errors (outliers) in the data. The ROB-ELM classifier was designed by means of M -estimation methods which are used to compute the weight matrix operator instead of using the ordinary least squares solution. By means of computer simulations on synthetic and real-world datasets we have shown that the resulting classifier is more robust to outliers than the standard ELM classifier. Currently, we are further evaluating the performance of the ROB-ELM on other binary classification datasets and also on multiclass problems. The results we obtained so far suggests that this is a promising approach.

References 1. Deng, W., Zheng, Q., Chen, L.: Regularized extreme learning machine. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM’09). pp. 389–395 (2009) 2. Fox, J.: Applied Regression Analysis, Linear Models, and Related Methods. Sage Publications (1997) 3. Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml 4. Horata, P., Chiewchanwattana, S., Sunat, K.: Robust extreme learning machine. Neurocomputing 102, 31–44 (2012) 5. Huang, G.B., Wang, D.H., Lan, Y.: Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics 2, 107–122 (2011) 6. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: Theory and applications. Neurocomputing 70, 489–501 (2006)

7. Huber, P.J.: Robust estimation of a location parameter. Annals of Mathematical Statistics 35(1), 73–101 (1964) 8. Huber, P.J., Ronchetti, E.M.: Robust Statistics. john Wiley & Sons, LTD (2009) 9. Kim, H.C., Ghahramani, Z.: Outlier robust gaussian process classification. In: Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition (SSPR)’08. pp. 896–905 (2008) 10. Lee, C.C., Chiang, Y.C., Shih, C.Y., Tsai, C.L.: Noisy time series prediction using m-estimator based robust radial basis function neural networks with growing and pruning techniques. Expert Systems and Applications 36(3), 4717–4724 (2009) 11. Lee, C.C., Chung, P.C., Tsai, J.R., Chang, C.I.: Robust radial basis function neural networks. IEEE Transactions on Systems, Man, and Cybernetics - Part B 29(6), 674–685 (1999) 12. Li, D., Han, M., Wang, J.: Chaotic time series prediction based on a novel robust echo state network. IEEE Transactions on Neural Networks and Learning Systems 23(5), 787–799 (2012) 13. Liu, N., Wang, H.: Ensemble based extreme learning machine. IEEE Signal Processing Letters 17(8), 754–757 (2010) 14. Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., Lendasse, A.: OP-ELM: Optimally pruned extreme learning machine. IEEE Transactions on Neural Networks 21(1), 158–162 (2010) 15. Miche, Y., van Heeswijk, M., Bas, P., Simula, O., Lendasse, A.: TROP-ELM: a double-regularized ELM using LARS and Tikhonov regularization. Neurocomputing 74(16), 2413–2421 (2011) 16. Mohammed, A., Minhas, R., Jonathan Wu, Q.M., Sid-Ahmed, M.A.: Human face recognition based on multidimensional PCA and extreme learning machine. Pattern Recognition 44(10–11), 2588–2597 (2011) 17. Neumann, K., Steil, J.: Optimizing extreme learning machines via ridge regression and batch intrinsic plasticity. Neurocomputing 102, 23–30 (2013) 18. Zong, W., Huang, G.B.: Face recognition based on extreme learning machine. Neurocomputing 74(16), 2541–2551 (2011)

Suggest Documents