Scalable Machine Learning on Spark for Multiclass ...

Scalable Machine Learning on Spark for Multiclass Problems Goutham Kamath1*, Sauptik Dhar2, Naveen Ramakrishnan2, David Hallac3, Jure Leskovec3, Mohak Shah2,4 1Georgia

State University, GA

2Robert

3Stanford

Bosch Research and Technology Center, CA

University, CA

4University

of Illinois at Chicago, IL

Emails: [email protected] {jure, hallac}@cs.stanford.edu {sauptik.dhar, naveen.ramakrishnan, mohak.shah}@us.bosch.com

Introduction and Motivation

 Supported Multiclass Algorithms

 Big-Data capability provides advantage by harnessing sophisticated insights relevant to various real-life applications.

Loss Function L( f w ,b (xi ), yi ) Regularizer R(w)

Methods

 Spark’s distributed in-memory computing framework provides practitioners with a powerful, fast, scalable and easy way to build big-data ML algorithms.

Multiclass Classification

2  w jc     jc  w jc  (1 )  2  c 1 j 1  C D

 However, existing large-scale ML tools on Spark such as : MLlib [1] or PhotonML [2] provide limited coverage for multi-class problems and sometimes inaccurate solutions.

L1, L2, L1-L2 regularized multinomial regression

1 N  i log(e

xTi w yi



C c 1

e

xTi w c

)

with   [0,1]

 This work extends “ ADMM based scalable machine learning on Spark ” [3] to handle large-scale machine learning for multi-class problems.

L1, L2, L1-L2 SVM

1 N  i ((1   yi c )  xTi (w yi  w c )) 

Existing state-of-the-art technologies  Distributed Algorithms First Order methods : use first order gradient estimates, secant approximates etc. ‒ Low per-iteration computation complexity, ‒ Dimension independent convergence, However, ‒ Slower convergence due to oscillations, ‒ Not well suited for ill-conditioned problems typically seen in Machine Learning, E.g. Parallel SGD [4], Hogwild! [5], Splash [6]. Second Order methods : use additional second order Hessian (approximate) information. ‒ Captures the curvature of the objective function, ‒ Faster convergence rate than first order methods, ‒ Well suited for ill conditioned problems seen in Machine Learning, However, ‒ Do not scale favorable with dimension, ‒ High per-iteration computation complexity. E.g. ADMM [7], DANE[8] etc.  Randomized Algorithms: uses a subsampled or low rank representation of the original big-data to solve a small-scale equivalent problem. ‒ Can be solved using traditional ML software, ‒ Avoids the need for distributed storage/analytics systems. However, ‒ Inexact/approximate solutions depends on data property. E.g. LOCO [9]

Experimental Results Hadoop Configuration (Apache 1.1.1)  No. of Nodes = 6  No. of cores (per node) = 12 (Intel @3.20GHz)  RAM size (per node) = 32 GB  Hard Disk size (per node) = 500 GB

ADMM Configuration: Adaptive  update [7] with initial x

N : no. of observations (samples) min

Solve,

w1



wC

min

w1

w C , z1

N

1 N zC

 L( f i 1

1 N

w1

wC

 L( f i 1

w1

D : dimensions.

( xi ), yi )   R( w1

N

wC

and y  {1,

L2-logistic L1-logistic

, C}

s.t. w c  z c c {1

C}

where, L() : loss function, R () : regularization.  Multinomial Logistic Regression ADMM Step iteration): xTi w yi N C 1 e  ‒ (w1 , w C ) k 1  arg min  log( C xT w )   w c  z ck  u ck i c w1 , w C N i 1 e  c1 2 2 Cc1 z jc  k 1 k 1 k 2 ‒ (z1 , z C )  arg min    jc { z jc  (1   ) }   w c  z  u c 2 ‒

u

k 1 c

w

k 1 c

z

k 1 c

u

k c

N

Here, L( f w1

wC

2

c, j

e



c 1

e

xTi w c

) and R (z1

zC ) =

 c, j

jc

w - step

* Work done during Goutham Kamath’s internship at Bosch.

MLlib 139.68(2.06) 266.9(169.16) 14045.7 (411.78) 13155.2 (307.60)

w opt  w *

2

‒ Fig. 1 shows MLlib (packaged with Spark) fails to provide optimal solutions.

{ z jc  (1   )

w

opt

compared

‒ PhotonML provides very fast and accurate algorithms, but limited to AVRO data formats, which incurs additional overhead for data conversion. z 2jc 2

Remarks ‒ ADMM decomposes a larger problem into two (or many) smaller sub-problems in variables w , z ‒ Computation overhead due to big-data is handled in the w -step in a distributed fashion, ‒ Many multiclass ML algorithms can be solved using a simple distributed QP solver. ADMM:

Methods ADMM Data Set size = 5GB with N = 200000, D =100 regression (   0.1   0 ) 157.57(0.04) regression (   0.1   1 ) 157.05(1.54) Data Set size = 50GB with N = 2000000, D =100 regression (   0.1   0 ) 13937.3 (10.34) regression (   0.1   1 ) 15381.8 (5.59)

2

c 1

xTi w yi

C

group 1 ,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, group g , 0,0,0, remaining 80% sparse group

 1 0.2 0.2  0.2 0.2     0.2 0.2 1  10  ]  0 

* w ‒ Fig. 1 shows the relative error of the estimated models ( ) i.e. opt to the scikit-learn [10] solution ( w ).

‒ ADMM provides optimal solutions (see Fig. 1) without compromising on computational efficiency (Table 1 & 2).

c

(xi ), yi )   log( i 1

2

y  sign(wT x   ) w  [ 1, 1, 1, 1, 1,1,1,1,1,1,

Discussion

(k+1th

z1 , z C

(0,)

 Real-Life MNIST Handwritten Digit Recognition Data(~ 1GB .csv format) 1 MLLIB 0.8 ADMM Table 2. Time performance (in sec) PhotonML 0.6 Methods ADMM PhotonML Binary 321.54 118.81 0.4 (5 vs. 8) (12.52) (1.39) 0.2 Multiclass 1229.56 N/A (3 vs 5 vs 8) (50.44) 0 0 20 40 60 80 100 Iteration Fig. 1. Solution Convergence

C : Total classes

zC )



L2-logistic L1-logistic

wC )

( xi ), yi )   R( z1

  1,   1

Relative Error

Given training samples T : (xi , y )

and

 0     

Table 1. Average time performance over 10 experiments in sec (std. deviation shown in parenthesis).

 Generic Multiclass Formulation D

D

Small Data (~5GB) , N = 200000 Big Data (~50GB), N = 2000000 Dimension (D) = 100

Our Approach where x 

 0.5

 Synthetic Data Set (Binary Classification used in [3])

 We propose a generic multi-class formulation and adopt the second order Alternating Direction Method of Multipliers (ADMM) to obtain more accurate solutions without compromising on computational efficiency.

N i i 1

Spark Configuration (Apache 1.6)  spark.num.executors = 17  spark.executor.memory = 6 GB  spark.driver.memory = 4 GB  spark.driver.maxResultSize = 4 GB

}

‒ However, ADMM convergence is very sensitive to



updates.

‒ Current ADMM solution provides biggest generic repository of efficient distributed machine learning algorithms available in Python.

Ongoing work ‒ Comparative study of the ADMM based tool for several other multiclass algorithms. ‒ Extension of the generic framework to solve advanced non-convex learning formulations.

References: [1] MLlib http://spark.apache.org/MLlib/ [2] PhotonML https://github.com/linkedin/photon-ml [3] S. Dhar, C. Yi, N. Ramakrishnan, M. Shah, "ADMM based scalable machine learning on Spark." IEEE Big Data, 2015. [4] M. Zinkevich et al. "Parallelized stochastic gradient descent." Advances in Neural Information Processing Systems. 2010. [5] B. Recht et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." Advances in Neural Information Processing Systems. 2011. [6] Y. Zhang, M. Jordan, “Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms”. (http://arxiv.org/abs/1506.07552). [7] S. Boyd et. al, “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends® in Machine Learning 3.1 (2011): 1-122 [8] S. Ohad, et. al., “Communication-Efficient Distributed Optimization using an Approximate Newton-type Method.” ICML. Vol. 32. No. 1. 2014. [9] B. McWilliams et al. “LOCO: Distributing ridge regression with random projections.” (https://arxiv.org/abs/1406.3469) [10] Scikit-Learn http://scikit-learn.org/stable/

Scalable Machine Learning on Spark for Multiclass ...

Scalable Machine Learning on Spark for Multiclass ...

Suggest Documents

Scalable Machine Learning on Spark for multiclass problems (PDF ...

MLlib: Scalable Machine Learning on Spark

Scalable Machine Learning Approaches for ... - Cise.Ufl.Edu

Multiclass Core Vector Machine - The International Machine Learning ...

Machine Learning with Sparkling Water: H2O + Spark

Universum Learning for Multiclass SVM

Universum Learning for Multiclass SVM

Towards Scalable Machine Learning - TU Dresden

Training Highly Multiclass Classifiers - Journal of Machine Learning ...

Training Highly Multiclass Classifiers - Journal of Machine Learning ...

A Scalable Machine Learning Approach for Inferring Probabilistic US-LI

AMIDST: a Java Toolbox for Scalable Probabilistic Machine Learning

A Scalable Machine-Learning Approach for Semi ... - CiteSeerX

Scalable Information Gain Variant on Spark Cluster for ... - ScienceDirect

A Scalable Machine Learning Online Service for Big ...

Laboratory Spark Erosion Machine

Generative Maximum Entropy Learning for Multiclass Classification

Multiclass Learning for Writer Identification using Error

Extreme Learning Machines for Multiclass Classification

Support vector machine (SVM) based multiclass ... - BioMedSearch

Scalable State-Machine Replication

Anatomy of machine learning algorithm implementations in MPI, Spark

[ZuiL-(( Download 'Mastering Machine Learning with Spark' Ebooks ...

Anatomy of machine learning algorithm implementations in MPI, Spark ...