Compilation of Programs to DAGs of Logical Operators Choice of ...

Samsara: Declarative Machine Learning on Distributed Dataflow Systems Sebastian Schelter Technische Universität Berlin [email protected]

Andrew Palumbo [email protected]

Shannon Quinn University of Georgia [email protected]

Suneel Marthi [email protected]

Andrew Musselman [email protected]

Overview Motivation:

Architectural Overview

apply ML to large datasets stored in distributed filesystems using dataflow engines like Spark

●

programs written in declarative Scala DSL compilation to DAG of logical operators

● ●

Problem:

Scala DSL In-core Algebra

Logical DAG Physical DAG

Samsara

Physical Translation: Spark

Physical Translation: Flink

Physical Translation: H20

Apache Spark

Apache Flink

H20

domain-specific language and execution layer for declarative machine learning ● allows advanced users to rapidly create new algorithms and adapt existing ones ● reduces need for knowledge of programming and execution model of underlying systems

●

●

translation to backend-specific physical operators execution on dataflow backend

Application code

dataflow engines are hard to program: ● programs consist of sequence of parallelizable second-order functions on partitioned data ● mismatch for ML applications which mostly operate on tensor type

●

●

Apache Mahout Samsara

Distributed Dataflow Systems

Compilation of Programs to DAGs of Logical Operators Resulting DAG of Logical Operators

Example: Distributed Ridge Regression ● ● ●

Materialization Barrier (Driver)

assumption: ‘tall-and-skinny’ input matrix X distributed computation of XTX and XTY solve normal equation (XTX + λI)-1 XTY Iocally afterwards

val drmXtX = drmX.t %*% drmX val drmXtY = drmX.t %*% drmY val XtX = drmXtX.collect val XtY = drmXtY.collect


OpAB

def dridge(data: DrmLike[Int], lambda: Double): Matrix { // slice out features, add column for bias term val drmX = data(::, 0 until data.ncol) cbind 1 val drmY = data(::, data.ncol)


OpAtA

OpAtB

logical optimizer rewrites and merges operators to reduce number of passes over data

OpAt

OpCBindScalar

// fetch results // into driver memory


OpAB

OpAt

// distributed matrix // multiplications

Rewritten DAG after Logical Optimization

OpCBindScalar

OpMapBlock

OpCBindScalar

OpCBindScalar

OpMapBlock

OpMapBlock

OpCBindScalar

OpMapBlock

OpMapBlock

OpMapBlock

OpMapBlock

XtX.diagv += lambda // add regularization solve(XtX, XtY) // compute parameters in-core on driver

Input

} Input

Choice of Physical Operators Variant A: Distributed Computation of XTX via Summation of Partial Outer Products

Example: Transpose-Times-Self on Spark ●

●

system chooses physical execution strategy for operators on a given backend based on three characteristics of the operands (1) special structure (e.g., diagonal matrices) (2) dimensions (e.g., ‘tall-and-skinny’) (3) partitioning (e.g., use local instead of distributed joins for copartitioned inputs)

output for 1st partition

( 11) ( 1

(

1 1 1 0 1 0 1 0

)

X1 worker 1

T

Example: Computation of X X in our regression code on Apache Spark (represented by OpAtA)

Variant B: Distributed Computation of X TX via Summation of Local Gram Matrices

1 1 0)

()

1 ( 1 0 1 0) 0

( 21

Σ

1 1 0)

( 10) ( 1

1 1 0)

)

(XTX)1

output for 2nd partition

( 10) ( 1

1 2 0 1 1 0

( 11

1 1 0 0 1 0

)

worker 1

Σ

output for 1st partition

OpAtA ( 0 0 1 1)

X2 worker 2

()

0 ( 0 0 1 1) 0 nd

output for 2 partition

)

●

X1TX1

X1

worker 3

(

2 1 2 0 . 1 1 0 . . 2 0 . . . 0

Σ

( 20

1 3 1 0 1 1

)

(XTX)2

()

1 ( 0 0 1 1) 1

( 0 0 1 1)

worker 4

X2

(

0 0 0 0 . 0 0 0 . . 1 1 . . . 1

X2TX2

)

(

2 1 2 0

1 1 1 0

2 1 3 1

0 0 1 1

XTX master

)

●

●

specialized physical operator for ‘tall-and-skinny’ matrices applies local pre-aggregation in workers assumes that upper half of local gram matrices fits into memory of worker process

worker 2

Evaluation, Limitations, Future Work Benefit of Proposed Optimizations ● ●

compared standard execution mode of Samsara with a variant that has optimizations disabled cluster of 24 machines, synthetically generated matrices with 20 columns and a growing number of rows 5x performance improvements due to choice of specialized operator for XTX and 3x due to choice of local vs distributed joins in XTY runtime (ms)

30000 25000

X Y non-optimized

X X optimized

20000 15000 10000 5000 40 0

60 0

80 0

20000

# entries of input matrix (millions)

●

10000 5000 0 0

20 0

40 0

60 0

80 0

10 00

# entries of input matrix (millions)

Machine Learning Systems Workshop at NIPS 2016

Limitations & Future Work ● lack of speed for in-core operations due to JVM-based matrix libraries → currently exploring integration of ViennaCL for selected operators to provide a bridge to native performance on many-core architectures ● high variance in performance between different backends ● e.g., due to lack of efficient caching of intermediate results in Apache Flink

X Y optimized

15000

00

20 0

10

0 0

●

25000

X X non-optimized

runtime (ms)

●

Future Work & References

Acknowledgements ● Samsara is the result of a community effort from the Apache Mahout project, with fundamental contributions to design and codebase by Dmitriy Lyubimov

●

References ● Lyubimov and Palumbo. "Apache Mahout: Beyond MapReduce." (2016) ● "Mahout Scala and Spark Bindings" http://mahout.apache.org/users/sparkbindings /ScalaSparkBindings.pdf

https://mahout.apache.org

Compilation of Programs to DAGs of Logical Operators Choice of ...

Compilation of Programs to DAGs of Logical Operators Choice of ...

Suggest Documents

Compilation of Programs to DAGs of Logical Operators Choice of ...

On the logical operators of quantum codes

Proof-Transforming Compilation of Eiffel Programs - CiteSeerX

Logical analysis of demonic nondeterministic programs - ScienceDirect

A logical approach to the verification of functional-logic programs

Arithmetics, Logical Operators & Elementary Functions

Logical Compilation of Bayesian Networks with Discrete ... - CiteSeerX

Knowledgebase Compilation for Efficient Logical Argumentation

the logical choice Skills Knowledge

Boosting Probabilistic Choice Operators - Inria

Knowledge Compilation of Logic Programs Using ... - UCLA CS

Efficient Compilation of Fine-Grained SPMD-threaded Programs for ...

Separate Compilation of Hierarchical Real-Time Programs ... - ARiSE

Efficient Compilation of .NET Programs for Embedded Systems

Facts and the Choice of Logical Foundations - Project Euclid

Logical Speci cations for Functional Programs? - Faculty of ...

deriving programs from parallel algorithms of logical ...

Lightweight compilation of (C) LP to JavaScript

Correct Compilation of Specifications to Deterministic Asynchronous

Compilation of Associative-Commutative

SOME LOGICAL INVARIANTS OF ALGEBRAS AND LOGICAL ...

Automatic compilation of songs

Compilation of References

Compilation of Associative-Commutative