Efficient Ensemble Learning with Support Vector Machines ... EnsembleSVM remedies these problems [1]. â·trains ... â·native interfaces to Python, R, MATLAB, .
Efficient Ensemble Learning with Support Vector Machines Marc Claesen
1,2
3
Frank De Smet
1 KU Leuven, ESAT–STADIUS, Leuven, 2 iMinds Medical IT, Leuven, Belgium 3
Johan Suykens
Belgium
1,2
1,2
Bart De Moor
STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics
KU Leuven, Dept. of Public Health, Leuven, Belgium
Large-scale learning with SVM
Development
Key issues in nonlinear SVM 2 I Ω(n ) training complexity (n instances) I difficult to parallelize/distribute I high memory requirements
implemented in C++11, pthread-parallelized I licensed under GNU LGPL v3+ (free software) I portable via GNU Autotools & libtool I
Benchmark results EnsembleSVM remedies these problems [1] I trains SVM models on (small) subsets I creates ensemble to improve generalization I current focus on binary classification I embarassingly parallel I reduced memory use
covtype (n = 100, 000)
ijcnn1 (n = 35, 000)
sensit (n = 78, 823)
LIBSVM ESVM
LIBSVM ESVM
LIBSVM ESVM
LIBSVM ESVM
accuracy (%) training time (s) memory (MB)
97 148 3200
93 0.3 77
92 728 9496
98 9.5 133
100.0
96.7
95.9
89 35 1510
× EnsembleSVM LIBSVM
Workflow
measure
rcv1 (n = 20, 242)
86.5 591 4187
83.8 7.9 122
96.9
accuracy training time memory use
100%
Train
98 0.3 8
15.9
Tr
...
(1)
...
(1)
SVM
Σ
Test
Tr
(p)
0.3 2.4 rcv1
4.8 covtype
3.2 6.0 ijcnn1
1.3 2.9 sensit
Conclusions (p)
SVM
ˆ y
EnsembleSVM functionality Base models: instance-weighted SVM I support for common & precomputed kernels I LIBSVM is used as solver [2] n X 1 T min w w + Ci ξi , w,ξ,ρ 2 i=1 T
subject to yi (w φ(xi ) + ρ) ≥ 1 − ξi , i = 1, . . . , n, ξi ≥ 0, i = 1, . . . , n. Aggregation of base model predictions I support for common aggregation schemes I flexible framework to prototype novel approaches
EnsembleSVM compared to standard SVM I significantly reduced training complexity I competitive generalization performance Future work I distributed implementation on Hadoop/Spark I GPGPU implementation using CUDA/OpenCL I native interfaces to Python, R, MATLAB, . . . References [1] M. Claesen, F. De Smet, J. Suykens, and B. De Moor, “EnsembleSVM: A library for ensemble learning using support vector machines,” Journal of Machine Learning Research, vol. 15, pp. 141–145, 2014. [2] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011.
Acknowledgements I Marc Claesen is funded by IWT grant number 111065, I Research Council KU Leuven: GOA/10/09 MaNet, KUL PFV/10/016 SymBioSys, PhD/Postdoc grants, I Industrial Research fund (IOF): IOF/HB/13/027 Logic Insulin, I Flemish Government: FWO: projects: G.0871.12N (Neural circuits); PhD/Postdoc grants; IWT: TBM-Logic Insulin(100793), TBM Rectal Cancer(100783), TBM IETA(130256); PhD/Postdoc grants; Hercules Stichting: Hercules 3: PacBio RS, Hercules 1: The C1 single-cell auto prep system, BioMark HD System and IFC controllers (Fluidigm) for single-cell analyses; iMinds Medical Information Technologies SBO 2014; VLK Stichting E. van der Schueren: rectal cancer, I EU: ERC AdG A-DATADRIVE-B.
More information at http://esat.kuleuven.be/stadius/ensemblesvm/