"Apache Mahout: Beyond MapReduce." (2016). â "Mahout Scala and Spark Bindings" http://mahout.apache.org/users/sparkbin
Samsara: Declarative Machine Learning on Distributed Dataflow Systems Sebastian Schelter Technische Universität Berlin
[email protected]
Andrew Palumbo
[email protected]
Shannon Quinn University of Georgia
[email protected]
Suneel Marthi
[email protected]
Andrew Musselman
[email protected]
Overview Motivation:
Architectural Overview
apply ML to large datasets stored in distributed filesystems using dataflow engines like Spark
●
programs written in declarative Scala DSL compilation to DAG of logical operators
● ●
Problem:
Scala DSL In-core Algebra
Logical DAG Physical DAG
Samsara
Physical Translation: Spark
Physical Translation: Flink
Physical Translation: H20
Apache Spark
Apache Flink
H20
domain-specific language and execution layer for declarative machine learning ● allows advanced users to rapidly create new algorithms and adapt existing ones ● reduces need for knowledge of programming and execution model of underlying systems
●
●
translation to backend-specific physical operators execution on dataflow backend
Application code
dataflow engines are hard to program: ● programs consist of sequence of parallelizable second-order functions on partitioned data ● mismatch for ML applications which mostly operate on tensor type
●
●
Apache Mahout Samsara
Distributed Dataflow Systems
Compilation of Programs to DAGs of Logical Operators Resulting DAG of Logical Operators
Example: Distributed Ridge Regression ● ● ●
Materialization Barrier (Driver)
assumption: ‘tall-and-skinny’ input matrix X distributed computation of XTX and XTY solve normal equation (XTX + λI)-1 XTY Iocally afterwards
val drmXtX = drmX.t %*% drmX val drmXtY = drmX.t %*% drmY val XtX = drmXtX.collect val XtY = drmXtY.collect
Materialization Barrier (Driver)
OpAB
def dridge(data: DrmLike[Int], lambda: Double): Matrix { // slice out features, add column for bias term val drmX = data(::, 0 until data.ncol) cbind 1 val drmY = data(::, data.ncol)
Materialization Barrier (Driver)
OpAtA
OpAtB
logical optimizer rewrites and merges operators to reduce number of passes over data
OpAt
OpCBindScalar
// fetch results // into driver memory
Materialization Barrier (Driver)
OpAB
OpAt
// distributed matrix // multiplications
Rewritten DAG after Logical Optimization
OpCBindScalar
OpMapBlock
OpCBindScalar
OpCBindScalar
OpMapBlock
OpMapBlock
OpCBindScalar
OpMapBlock
OpMapBlock
OpMapBlock
OpMapBlock
XtX.diagv += lambda // add regularization solve(XtX, XtY) // compute parameters in-core on driver
Input
} Input
Choice of Physical Operators Variant A: Distributed Computation of XTX via Summation of Partial Outer Products
Example: Transpose-Times-Self on Spark ●
●
system chooses physical execution strategy for operators on a given backend based on three characteristics of the operands (1) special structure (e.g., diagonal matrices) (2) dimensions (e.g., ‘tall-and-skinny’) (3) partitioning (e.g., use local instead of distributed joins for copartitioned inputs)
output for 1st partition
( 11) ( 1
(
1 1 1 0 1 0 1 0
)
X1 worker 1
T
Example: Computation of X X in our regression code on Apache Spark (represented by OpAtA)
Variant B: Distributed Computation of X TX via Summation of Local Gram Matrices
1 1 0)
()
1 ( 1 0 1 0) 0
( 21
Σ
1 1 0)
( 10) ( 1
1 1 0)
)
(XTX)1
output for 2nd partition
( 10) ( 1
1 2 0 1 1 0
( 11
1 1 0 0 1 0
)
worker 1
Σ
output for 1st partition
OpAtA ( 0 0 1 1)
X2 worker 2
()
0 ( 0 0 1 1) 0 nd
output for 2 partition
)
●
X1TX1
X1
worker 3
(
2 1 2 0 . 1 1 0 . . 2 0 . . . 0
Σ
( 20
1 3 1 0 1 1
)
(XTX)2
()
1 ( 0 0 1 1) 1
( 0 0 1 1)
worker 4
X2
(
0 0 0 0 . 0 0 0 . . 1 1 . . . 1
X2TX2
)
(
2 1 2 0
1 1 1 0
2 1 3 1
0 0 1 1
XTX master
)
●
●
specialized physical operator for ‘tall-and-skinny’ matrices applies local pre-aggregation in workers assumes that upper half of local gram matrices fits into memory of worker process
worker 2
Evaluation, Limitations, Future Work Benefit of Proposed Optimizations ● ●
compared standard execution mode of Samsara with a variant that has optimizations disabled cluster of 24 machines, synthetically generated matrices with 20 columns and a growing number of rows 5x performance improvements due to choice of specialized operator for XTX and 3x due to choice of local vs distributed joins in XTY runtime (ms)
30000 25000
X Y non-optimized
X X optimized
20000 15000 10000 5000 40 0
60 0
80 0
20000
# entries of input matrix (millions)
●
10000 5000 0 0
20 0
40 0
60 0
80 0
10 00
# entries of input matrix (millions)
Machine Learning Systems Workshop at NIPS 2016
Limitations & Future Work ● lack of speed for in-core operations due to JVM-based matrix libraries → currently exploring integration of ViennaCL for selected operators to provide a bridge to native performance on many-core architectures ● high variance in performance between different backends ● e.g., due to lack of efficient caching of intermediate results in Apache Flink
X Y optimized
15000
00
20 0
10
0 0
●
25000
X X non-optimized
runtime (ms)
●
Future Work & References
Acknowledgements ● Samsara is the result of a community effort from the Apache Mahout project, with fundamental contributions to design and codebase by Dmitriy Lyubimov
●
References ● Lyubimov and Palumbo. "Apache Mahout: Beyond MapReduce." (2016) ● "Mahout Scala and Spark Bindings" http://mahout.apache.org/users/sparkbindings /ScalaSparkBindings.pdf
https://mahout.apache.org