Compilation of Programs to DAGs of Logical Operators Choice of ...

0 downloads 83 Views 697KB Size Report
"Apache Mahout: Beyond MapReduce." (2016). ○ "Mahout Scala and Spark Bindings" http://mahout.apache.org/users/sparkbin
Samsara: Declarative Machine Learning on Distributed Dataflow Systems Sebastian Schelter Technische Universität Berlin [email protected]

Andrew Palumbo [email protected]

Shannon Quinn University of Georgia [email protected]

Suneel Marthi [email protected]

Andrew Musselman [email protected]

Overview Motivation:

Architectural Overview

apply ML to large datasets stored in distributed filesystems using dataflow engines like Spark



programs written in declarative Scala DSL compilation to DAG of logical operators

● ●

Problem:

Scala DSL In-core Algebra

Logical DAG Physical DAG

Samsara

Physical Translation: Spark

Physical Translation: Flink

Physical Translation: H20

Apache Spark

Apache Flink

H20

domain-specific language and execution layer for declarative machine learning ● allows advanced users to rapidly create new algorithms and adapt existing ones ● reduces need for knowledge of programming and execution model of underlying systems





translation to backend-specific physical operators execution on dataflow backend

Application code

dataflow engines are hard to program: ● programs consist of sequence of parallelizable second-order functions on partitioned data ● mismatch for ML applications which mostly operate on tensor type





Apache Mahout Samsara

Distributed Dataflow Systems

Compilation of Programs to DAGs of Logical Operators Resulting DAG of Logical Operators

Example: Distributed Ridge Regression ● ● ●

Materialization Barrier (Driver)

assumption: ‘tall-and-skinny’ input matrix X distributed computation of XTX and XTY solve normal equation (XTX + λI)-1 XTY Iocally afterwards

val drmXtX = drmX.t %*% drmX val drmXtY = drmX.t %*% drmY val XtX = drmXtX.collect val XtY = drmXtY.collect

Materialization Barrier (Driver)

OpAB

def dridge(data: DrmLike[Int], lambda: Double): Matrix { // slice out features, add column for bias term val drmX = data(::, 0 until data.ncol) cbind 1 val drmY = data(::, data.ncol)

Materialization Barrier (Driver)

OpAtA

OpAtB

logical optimizer rewrites and merges operators to reduce number of passes over data

OpAt

OpCBindScalar

// fetch results // into driver memory

Materialization Barrier (Driver)

OpAB

OpAt

// distributed matrix // multiplications

Rewritten DAG after Logical Optimization

OpCBindScalar

OpMapBlock

OpCBindScalar

OpCBindScalar

OpMapBlock

OpMapBlock

OpCBindScalar

OpMapBlock

OpMapBlock

OpMapBlock

OpMapBlock

XtX.diagv += lambda // add regularization solve(XtX, XtY) // compute parameters in-core on driver

Input

} Input

Choice of Physical Operators Variant A: Distributed Computation of XTX via Summation of Partial Outer Products

Example: Transpose-Times-Self on Spark ●



system chooses physical execution strategy for operators on a given backend based on three characteristics of the operands (1) special structure (e.g., diagonal matrices) (2) dimensions (e.g., ‘tall-and-skinny’) (3) partitioning (e.g., use local instead of distributed joins for copartitioned inputs)

output for 1st partition

( 11) ( 1

(

1 1 1 0 1 0 1 0

)

X1 worker 1

T

Example: Computation of X X in our regression code on Apache Spark (represented by OpAtA)

Variant B: Distributed Computation of X TX via Summation of Local Gram Matrices

1 1 0)

()

1 ( 1 0 1 0) 0

( 21

Σ

1 1 0)

( 10) ( 1

1 1 0)

)

(XTX)1

output for 2nd partition

( 10) ( 1

1 2 0 1 1 0

( 11

1 1 0 0 1 0

)

worker 1

Σ

output for 1st partition

OpAtA ( 0 0 1 1)

X2 worker 2

()

0 ( 0 0 1 1) 0 nd

output for 2 partition

)



X1TX1

X1

worker 3

(

2 1 2 0 . 1 1 0 . . 2 0 . . . 0

Σ

( 20

1 3 1 0 1 1

)

(XTX)2

()

1 ( 0 0 1 1) 1

( 0 0 1 1)

worker 4

X2

(

0 0 0 0 . 0 0 0 . . 1 1 . . . 1

X2TX2

)

(

2 1 2 0

1 1 1 0

2 1 3 1

0 0 1 1

XTX master

)





specialized physical operator for ‘tall-and-skinny’ matrices applies local pre-aggregation in workers assumes that upper half of local gram matrices fits into memory of worker process

worker 2

Evaluation, Limitations, Future Work Benefit of Proposed Optimizations ● ●

compared standard execution mode of Samsara with a variant that has optimizations disabled cluster of 24 machines, synthetically generated matrices with 20 columns and a growing number of rows 5x performance improvements due to choice of specialized operator for XTX and 3x due to choice of local vs distributed joins in XTY runtime (ms)

30000 25000

X Y non-optimized

X X optimized

20000 15000 10000 5000 40 0

60 0

80 0

20000

# entries of input matrix (millions)



10000 5000 0 0

20 0

40 0

60 0

80 0

10 00

# entries of input matrix (millions)

Machine Learning Systems Workshop at NIPS 2016

Limitations & Future Work ● lack of speed for in-core operations due to JVM-based matrix libraries → currently exploring integration of ViennaCL for selected operators to provide a bridge to native performance on many-core architectures ● high variance in performance between different backends ● e.g., due to lack of efficient caching of intermediate results in Apache Flink

X Y optimized

15000

00

20 0

10

0 0



25000

X X non-optimized

runtime (ms)



Future Work & References

Acknowledgements ● Samsara is the result of a community effort from the Apache Mahout project, with fundamental contributions to design and codebase by Dmitriy Lyubimov



References ● Lyubimov and Palumbo. "Apache Mahout: Beyond MapReduce." (2016) ● "Mahout Scala and Spark Bindings" http://mahout.apache.org/users/sparkbindings /ScalaSparkBindings.pdf

https://mahout.apache.org

Suggest Documents