Extreme Large Margin Distribution Machine and Its Applications for ...

Introduction A brief introduction to ELM Proposed methods Empirical study Future works Reference References

Extreme Large Margin Distribution Machine and Its Applications for Biomedical Datasets Zhiyong Yang, Jingcheng Lu, Taohong Zhang University of Science and Technology Beijing

Presented by Zhiyong Yang December 18th 2016

Zhiyong Yang, Jingcheng Lu, Taohong Zhang — Extreme Large Margin Distribution Machine and Its Applications for Biomedical Da 1/50


Outline

1 Introduction 2 A brief introduction to ELM 3 Proposed methods 4 Empirical study 5 Future works 6 Reference



Outline




Introduction

Classification methods have become increasingly popular for biomedical and bioinformatical data analysis[9, 6, 8, 11, 7]. However, due to the difficulty of data acquisition, sometimes we could only obtain small-scale datasets which may lead to unreasonable generalization performances due to overfitting. We could resort to large margin theory to find out solutions for such dilemma. This paper aims at providing a large margin formation for Extreme Learning Machine(ELM), an efficient training algorithm for single hidden layer feedforward neural networks.



Our contributions

A novel algorithm called Extreme Large Margin Distribution Machine (ELDM) is proposed by bridging Large margin Distribution Machine (LDM) [13] and ELM [5]. The global optimality of ELDM is proved subsequently. An efficiently multi-class extension of ELDM For general propose, we evaluate the performance of our proposed algorithm on 7 benchmark datasets. While for biomedical applications, we evaluate our proposed algorithm on 5 small disease diagnosis datasets.



Outline




Illustration

Figure: An illustration of ELM



Basic ELM I 1

d Given a training set D = {Xi , Yi }m i=1 , where Xi ∈ R is the ith input vector, Yi ∈ Rnc , is the ith target vector, i = 1, 2, · · · , m and a piecewise continuous activation function g (·)

2

randomly generate input layer weight matrix W ∈ Rd×nh and the hidden layer bias b ∈ Rnh , where nh is the number of hidden layer neurons.

3

Calculate the hidden layer features     h1 (X1 ) · · · hnh (X1 ) h(X1 )    .. .. ···  H =  ... = . . h(Xm ) h1 (Xm ) · · · hn (Xm )

(1)

h

where hi (Xj ) = g W (i)> Xj + bi and W (i) is the ith column of W



Basic ELM II

4

Estimate the output layer weight β by following equation ˆ = H† Y = (H> H + I )−1 H> Y β C

(2)

where Y is the target vector (Y1 , · · · , Ym )> .



Basic ELM III

From Eq.(2), β could be also regraded as the solution of the following least square problem : Problem P (P) : argmin β

1 1 tr (β > β) + Ctr (Hβ − Y)> (Hβ − Y) 2 2

(3)



Advantages of ELM

advantages Iteration-free Easy to be implemented Efficient Theoretical supports [5, 4, 3]



A problem with ELM

ELM forces all the functional margins yi h(xi )> β to be close to 1 via the square loss.

m m X X (Hβ − Y)> (Hβ − Y) = (yi − h(xi )> β)2 = (1 − yi h(xi )> β)2 i=1

i=1



how to push the instances further away from h> β = 1 or h> β = −1?



Outline




The margin distribution on the training dataset In order to maximize the functional margins on average, we could optimize their distribution. ([1], [10]) The mean and variance of the functional margin on D: mean m

1 X 1 γ1 = yi β > h(Xi ) = (H> Y)> β m m

(4)

i=1

variance

γ2 =

m m 2 1 XX (yi β > h(Xi ) − yj β > h(Xj )) 2 m i=1 j=1

(5)

2 > > 1 = β H Hβ − ( )β > H> YY> Hβ m m



ELDM & LDM

Formulation of LDM [13, 14, 12] : merge γ1 and γ2 into the objective function of SVM. Formulation of ELDM (ours): merge γ1 and γ2 into the objective function of ELM. Though seemingly similar these algorithms may be, they adopt completely different solutions.



Formulation of ELDM I

1

A good distribution of function margin must have a large mean with a small variance.

2

Add λ2 γ2 − λ1 γ1 into the original objective function of Problem P, the result of which is denoted Problem Q through out this paper. Then our proposed algorithm, Extreme Margin Distribution Machine, gives the closed-form solution of Problem Q.



Formulation of ELDM II

Problem Q (Q) : minimize :

1 > 1 β β + C (Hβ − Y)> (Hβ − Y) + 2 | {z } |2 {z } l1

l2

(6)

λ2 γ2 − λ1 γ1 | {z } l3



Formation of ELDM III

Explanations l1 : a regularizer to punish high model complexity l2 : a weak constraint that require all the functional margins to be close to one and thus be positive l3 : objective function for an optimal margin distribution



Solution of ELDM I

Theorem A global minimum solution of problem Q could be represented as : β∗ = where k1 =

−1 I k2 + k1 H> H − hh> (k3 h) C m

(7)

4λ2 λ1 4λ2 + 1, k2 = , k3 = + 1 and h = H> Y Cm Cm Cm



Solution of ELDM II

Proof The objective function of Problem Q could be rearranged as follows L=

λ 1 > 1 λ1 2 β β+ (Hβ − Y)> (Hβ − Y) + γ2 − γ1 2C 2 C C

(8)

By setting the partial derivative of L with respect to β to zero we have 4λ2 4λ2 λ1 > I +( + 1)H> H − 2 hh> β − H> Y − H Y = 0 (9) C Cm m Cm Up to now we have reached an optimal solution β ∗ of Q



Solution of ELDM III Proof −1 k2 I + k1 H> H − hh> (k3 h) (10) C m Since problem Q is an unconstrained quadratic programming with respect to β, to prove the global optimality of β ∗ , we only need to prove that the Hessian matrix of L, which is denoted as He, is positive semi-definite. We have β∗ =

I + H> H + (k1 − 1)H> H C 1 − (k1 − 1) H> YY> H m I 1 > = + H H + (k1 − 1)H> (I − YY> )H C m

He =

(11)



Solution of ELDM IV

Proof Obviously, the first two terms of He are all semi positive-definite. For > 1 the last term, it is easy to notice that I− YY> is an idempotent m matrix. As a result, the last term of He could be expressed as (k1 − 1)H> (I −

1 YY> )H = KK> m

(12)

√ 1 where K = k1 − 1(I − YY> )H. Then by Eq.(12), He must be m positive semi-definite.



Solution of ELDM V

When nh < 2m the computing the solution proposed in the theorem above takes O(8nh3 + 2m2 nh ). When nh > 2m and that is exactly the case for small datasets, we could further simplify the solution as follows. ˜> I + H ˜H ˜ > −1 Hh ˜ β ∗ = Ck3 h − Ck3 H C

(13)

˜ = [H> K> ]> According to Eq.(13) computational where H complexity for calculating the inversion is O(8m3 + 2nh2 m), which is much faster than Eq.(7) for small datasets.



Speed up multi-class ELDM I Since LDM is, in nature, proposed for binary class problems. We only consider the One vs. All scheme for multi-class extensions.

While naively solving this problem forces us to repeat the binary case algorithm nc , is there a more efficient solution?



Speed up multi-class ELDM II

Now considering the ith subproblem, which use the ith column of Y, i.e. Y(i) as its target output. Following Theorem 1, we have ∗

β (i) =

−1 I k2 + k1 H> H − hi h> (k3 hi ) i C m

(14)

where hi = H> Y(i) , β (i)∗ is the ith column of output layer parameter β.



Speed up multi-class ELDM III

Given the following notations notations ∗

−1 I k2 + k1 H> H − hi h> (k3 hi ) i C m k2 I Ai = + k1 H> H − hi h> i C m Bi = k3 hi

β (i) =

A=

I + k1 H> H C

(15) (16) (17) (18)



Speed up multi-class ELDM IV

Then following Woodbury matrix identity [2], we have :

A−1 i

k2 > vi vi m = A−1 + k2 1 − hi > A−1 hi m

(19)

where vi = A−1 hi . Thanks to Eq.(19), we could reducethe 3 2 complexity for computing A−1 to i s from O nc nh + mnh O nh3 + mnh2 .



Outline




datasets I 7 libsvm datasets and 5 biomedical datasets from Kent Ridge Bio-medical Dataset are employed to validate the effectiveness of our proposed method. Table: Basic Information of the employed datasets

austrilian diabete german heart glass svmguide2 wine

#instance

#feature

#class

690 768 1000 270 214 391 178

14 8 24 13 9 20 13

2 2 2 2 6 3 3



datasets II

Table: Summery of Biomedical Datasets

DLBCL Harvard Outcome DLBCL Stanford LungCancer Ontario Leukemia ALLAML Leukemia MLL

#instance

#feature

#class

58 47 39 72 72

6817 4026 2880 7129 12582

2 2 2 2 3



Experiment setting I

For each dataset, a 5 fold cross-validation is repeated twice in search of the best performance, and the best average accuracies are recorded. ELDM : tanh is selected as the hidden layer activation function, C is selected from range 2[−3:3] , both λ1 and λ2 are chosen from range 2[−8:−3] . For binary classifications, nh is selected from {50, 100, 150, 200, 300}. For multi-class classifications, nh is selected from {50, 100, 150, 200, 300, 1000}. For biomedical datasets nh is selected from {100, 150 ,200, 300, 500, 1000, 2000, 3000}.



Experiment setting II

ELM : the same as ELDM except λ1 and λ2 . LDM : the parameter settings for nh , C , λ1 and λ2 are the same as those of ELDM. The Gaussian kernel function is employed for LDM. And the corresponding parameter is chosen from 2[−5:5] .



Results for libsvm datasets I

Table: Accuracy Comparisons


ELDM

ELM

LDM

0.8658±0.0148 0.7768±0.0262 0.7384±0.0137 0.8200±0.0239 0.6961±0.0693 0.8439±0.0527 0.9972±0.0088

0.8655±0.0139 0.7701±0.0305 0.734±0.0107 0.8037±0.0232 0.6891±0.0656 0.8273±0.0484 0.9916±0.0135

0.8580±0.0108 0.7734±0.0212 0.7446±0.0108 0.8304±0.0303 0.7024±0.0721 0.8413±0.0413 0.9887±0.0199



Results for libsvm datasets II

Table: Efficiency Comparisons for benchmark datasets


ELDM

ELM

LDM

0.0081 0.0080 0.0100 0.0067 0.0063 0.0050 0.0047

0.0074 0.0063 0.0088 0.0016 0.0021 0.0031 0.0017

0.4447 0.5368 1.7182 0.0265 0.3613 0.9113 0.0913



Results for libsvm datasets III

1

First of all, ELDM could always outperform ELM on all of the 7 libsvm datasets. Hence ELM could benefit from margin distribution optimization.

2

Secondly, compared to LDM, ELDM could slightly outperform LDM on 4 out of the 7 benchmark datasets. However, if we further comparing the efficiency of ELDM and LDM, it is obvious to see that the running time results of LDM are much slower than that of ELDM for all 7 datasets.

3

As a result, ELDM is at least no worse than LDM for most of the tested datasets with higher efficiency.



Results for biomedical datasets I

Table: Accuracy Results of the biomedical datasets

DLBCL Harvard Outcome DLBCL Stanford LungCancer Ontario Leukemia ALLAML Leukemia MLL

ELDM

ELM

LDM

0.6818±0.1090 0.9256±0.0873 0.7607±0.1360 0.7448±0.1148 0.8538±0.062

0.6470±0.0803 0.9044±0.0783 0.7339±0.1462 0.7219 ±0.1268 0.7929 ±0.0686

0.6364±0.1037 0.9044±0.1028 0.7339±0.1377 0.6800±0.1437 0.8757±0.0519



Results for biomedical datasets II 0.65

0.7

0.6

ldm

0.55 0

1000

2000

0.65 0.6 0.55 -4

3000

accuracy

eldm elm

accuracy

accuracy

0.7 0.65

eldm elm ldm

-2

nh

0

2

0.6 0.55 0.5 -5

4

0

5

log2 γ

log2 C

Figure: Parameter Analysis for DCBL Harvord Outcome dataset

1

0.9 0.88 0.86 0

1000

2000

nh

3000

1 eldm elm ldm

0.8

0.6 -4

-2

0

log2 C

2

ldm

accuracy

eldm elm

0.92

accuracy

accuracy

0.94

4

0.8

0.6 -5

0

log2 γ

5

Figure: Parameter Analysis for DCBL Stanford dataset



Results for biomedical datasets III

0.74 0.72 0.7 0

1000

2000

0.74 0.72 0.7 -4

3000

ldm

eldm elm ldm

0.76

accuracy

eldm elm

0.76

accuracy

accuracy

0.75

0.78

0.78

-2

nh

0

2

0.7 0.65 0.6 -5

4

log2 C

0

5

log2 γ

Figure: Parameter Analysis for LungCancer Ontario dataset

0.75

0.72 0.7 0.68 0

1000

2000

nh

3000

0.68 ldm

accuracy

eldm elm

0.74

accuracy

accuracy

0.76

0.7

0.65 -4

eldm elm ldm

-2

0

log2 C

2

4

0.66

0.64 -5

0

log2 γ

5

Figure: Parameter Analysis for Leukemia ALLAML dataset



Results for biomedical datasets IV

1

1

0.6

eldm elm

0.5 0

1000

2000

nh

3000

accuracy

0.7

accuracy

accuracy

ldm

0.8

0.8 0.6 0.4 -4

-2

0

log2 C

eldm elm 2 ldm 4

0.8 0.6 0.4 -5

0

5

log2 γ

Figure: Parameter Analysis for Leukemia AML dataset



Results for biomedical datasets V

1

ELDM could outperform both of its prototypes on 4 out of 5 employed datasets. For Leukemia MLL, though not better than LDM, ELDM performs significantly better than ELM.

2

ELDM could significantly outperform ELM for most of the involved parameters.



Discussion I

1

For small biomedical datasets, the improvements of ELDM on generalization performance are higher than that of the libsvm benchmark datasets.

2

The variances for libsvm benchmark datasets are much smaller than that of the biomedical datasets.

3

Correspondingly, for benchmark datasets, overfitting may not be of a much issue. In such cases, ELDM and LDM get similar performance. However, due to the high variance and limited instances, the 5 small biomedical datasets may probably suffer from overfiting. In such cases, ELDM could outperform LDM with the following possible reason:

4



Discussion II

LDM adopts an iterative manner to update its parameter, the stopping condition is partially (the maximum number of iterations could also control stopping condition) based on the performance of training set. When nh is small, ELM has much lower model complexity.



Outline




Future works

Future Works Hierarchical ELDM & Deep ELDM Multi category ELDM Efficient hyper-parameter Optimization Efficient extension toward big data analyses



Outline




References I [1] Wei Gao and Zhi-Hua Zhou. On the doubt about margin explanation of boosting. Artif. Intell., 203:1–18, 2013. [2] William W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221–239, 1989. [3] Guang-Bin Huang. An insight into extreme learning machines: Random neurons, random features and kernels. Cognitive Computation, 6(3):376–390, 2014. [4] Guang-Bin Huang, Lei Chen, and Chee-Kheong Siew. Universal approximation using incremental constructive feedforward networks with random hidden nodes. Neural Networks, IEEE Transactions on, 17(4):879–892, July 2006. [5] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: Theory and applications. Neurocomputing, 70(1–3):489 – 501, 2006. Neural NetworksSelected Papers from the 7th Brazilian Symposium on Neural Networks (SBRN ’04)7th Brazilian Symposium on Neural Networks. [6] Jyrki L¨ otj¨ onen, Robin Wolz, Juha Koikkalainen, Valtteri Julkunen, Lennart Thurfjell, Roger Lundqvist, Gunhild Waldemar, Hilkka Soininen, and Daniel Rueckert. Fast and robust extraction of hippocampus from MR images for diagnostics of alzheimer’s disease. NeuroImage, 56(1):185–196, 2011. [7] Yun Niu and Yuwei Wang. Protein-protein interaction identification using a hybrid model. Artificial Intelligence in Medicine, 64(3):185–193, 2015.



References II [8] S. C. Saxena, V. Kumar, and S. T. Hamde. Feature extraction from ECG signals using wavelet transforms for disease diagnostics. Int. J. Systems Science, 33(13):1073–1085, 2002. [9] Harsh Shrivastava, Vijay Huddar, Sakyajit Bhattacharya, and Vaibhav Rajan. Classification with imbalance: A similarity-based method for predicting respiratory failure. In 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015, Washington, DC, USA, November 9-12, 2015, pages 707–714, 2015. [10] Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi hua Zhou, Jufu Feng, and Manfred Warmuth. A refined margin analysis for boosting algorithms via equilibrium margin. Journal of Machine Learning Research, 2011. [11] Zhisen Wei, Ke Han, Jing-Yu Yang, Hong-Bin Shen, and Dong-Jun Yu. Protein-protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing, 193:201–212, 2016. [12] Cuihong Wen, Jing Zhang, Ana Rebelo, and Fanyong Cheng. A directed acyclic graph-large margin distribution machine model for music symbol classification. Plos One, 11(3), 2016.



References III

[13] Teng Zhang and Zhi-Hua Zhou. Large margin distribution machine. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, pages 313–322, 2014. [14] Yu-Hang Zhou and Zhi-Hua Zhou. Large margin distribution learning with cost interval and unlabeled data. IEEE Trans. Knowl. Data Eng., 28(7):1749–1763, 2016.



Thank you for your attention!