Jun 11, 2014 - library for scalable machine learning (ML). ⢠started six years ... âfind a low-dimensional represent
Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOTO Berlin 11/06/2014
Overview • • • • •
Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of XTX
Overview • • • • •
Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of XTX
Apache Mahout: History • library for scalable machine learning (ML) • started six years ago as ML on MapReduce • focus on popular ML problems and algorithms – Collaborative Filtering „find interesting items for users based on past behavior“ – Classification „learn to categorize objects“ – Clustering „find groups of similar objects“ – Dimensionality Reduction „find a low-dimensional representation of the data“ •
large userbase (e.g. Adobe, AOL, Accenture, Foursquare, Mendeley, Researchgate, Twitter)
Background: MapReduce • simple paradigm for distributed processing (proposed by Google) • user implements two functions map and reduce • system executes program in parallel, scales to clusters with thousands of machines • popular open source implementation: Apache Hadoop
Background: MapReduce
Apache Mahout: Problems • MapReduce not well suited for ML – slow execution, especially for iterations – constrained programming model makes code hard to write, read and adjust – lack of declarativity – lots of handcoded joins necessary
• → Abandonment of MapReduce – will reject new MapReduce implementations – widely used „legacy“ implementations will be maintained
• → „Reboot“ with a new DSL
Overview • • • • •
Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of XTX
Requirements for an ideal ML environment 1.
R/Matlab-like semantics – type system that covers linear algebra and statistics
2. Modern programming language qualities – functional programming – object oriented programming – scriptable and interactive
3. Scalability – automatic distribution and parallelization with sensible performance
Requirements for an ideal ML environment 1.
R/Matlab-like semantics – type system that covers linear algebra and statistics
2. Modern programming language qualities – functional programming – object oriented programming – scriptable and interactive
3. Scalability – automatic distribution and parallelization with sensible performance
Requirements for an ideal ML environment 1.
R/Matlab-like semantics – type system that covers linear algebra and statistics
2. Modern programming language qualities – functional programming – object oriented programming – scriptable and interactive
3. Scalability – automatic distribution and parallelization with sensible performance
Scala DSL • Scala as programming/scripting environment • R-like DSL :
G BB
T
C C
T
T
T
sq sq
val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)
• Declarativity! • Algebraic expression optimizer for distributed linear algebra – provides a translation layer to distributed engines – currently supports Apache Spark only – might support Apache Flink in the future
Data Types •
Scalar real values
•
In-memory vectors – dense – 2 types of sparse
•
In-memory matrices – sparse and dense – a number of specialized matrices
•
Distributed Row Matrices (DRM) – huge matrix, partitioned by rows – lives in the main memory of the cluster – provides small set of parallelized operations – lazily evaluated operation execution
val x = 2.367
val v = dvec(1, 0, 5)
val w = svec((0 -> 1)::(2 -> 5):: Nil) val A = dense((1, 0, 5), (2, 1, 4), (4, 3, 1)) val drmA = drmFromHDFS(...)
Features (1) • Matrix, vector, scalar operators: in-memory, out-of-core
drmA %*% drmB A %*% x A.t %*% drmB A * B
• Slicing operators
A(5 until 20, 3 until 40) A(5, ::); A(5, 5); x(a to b)
• Assignments (in-memory only)
A(5, ::) := x A *= B A -=: B; 1 /:= x
• Vector-specific • Summaries
x dot y; x cross y
A.nrow; x.length; A.colSums; B.rowMeans x.sum; A.norm
Features (2) • solving linear systems
val x = solve(A, b)
• in-memory decompositions
val val val val
• out-of-core decompositions
val (drmQ, inMemR) = thinQR(drmA) val (drmU, drmV, s) = dssvd(drmA, k = 50, q = 1)
• caching of DRMs
(inMemQ, inMemR) = qr(inMemM) ch = chol(inMemM) (inMemV, d) = eigen(inMemM) (inMemU, inMemV, s) = svd(inMemM)
val drmA_cached = drmA.checkpoint() drmA_cached.uncache()
Overview • • • • •
Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of XTX
Cereals Name
protein fat carbo sugars
rating
Apple Cinnamon Cheerios 2
2
10.5
10
29.509541
Cap‘n‘Crunch
1
2
12
12
18.042851
Cocoa Puffs
1
1
12
13
22.736446
Froot Loops
2
1
11
13
32.207582
Honey Graham Ohs
1
2
12
11
21.871292
Wheaties Honey Gold
2
1
16
8
36.187559
Cheerios
6
2
17
1
50.764999
Clusters
3
2
13
7
40.400208
Great Grains Pecan
3
3
13
4
45.811716
http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
Linear Regression • Assumption: target variable y generated by linear combination of feature matrix X with parameter vector β, plus noise ε
y X • Goal: find estimate of the parameter vector β that explains the data well • Cereals example X = weights of ingredients y = customer rating
Data Ingestion • Usually: load dataset as DRM from a distributed filesystem: val drmData = drmFromHdfs(...)
• ‚Mimick‘ a large dataset for our example: val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // (1, 2, 12, 12, 18.042851), // (1, 1, 12, 13, 22.736446), // (2, 1, 11, 13, 32.207582), // (1, 2, 12, 11, 21.871292), // (2, 1, 16, 8, 36.187559), // (6, 2, 17, 1, 50.764999), // (3, 2, 13, 7, 40.400208), // (3, 3, 13, 4, 45.811716)), // numPartitions = 2)
Apple Cinnamon Cheerios Cap'n'Crunch Cocoa Puffs Froot Loops Honey Graham Ohs Wheaties Honey Gold Cheerios Clusters Great Grains Pecan
Data Preparation • Cereals example: target variable y is customer rating, weights of ingredients are features X • extract X as DRM by slicing, fetch y as in-core vector val drmX = drmData(::, 0 until 4) val y = drmData.collect(::, 4)
y
drmX
2
2
10 .5
10
29 .509541
1
2
12
12
18 .042851
1
1
12
13
22 .736446
2
1
11
13
32 .207582
1
2
12
11
21 .871292
2
1
16
8
36 .187559
6
2
17
1
50 .764999
3
2
13
7
40 .400208
3
3
13
4
45 .811716
Estimating β • Ordinary Least Squares: minimizes the sum of residual squares between true target variable and prediction of target variable • Closed-form expression for estimation of ß as ˆ ( X X ) T
1
X
T
y
• Computing XTX and XTy is as simple as typing the formulas: val drmXtX = drmX.t %*% drmX val drmXty = drmX %*% y
Estimating β • Solve the following linear system to get least-squares estimate of ß T T X X ˆ X y
• Fetch XTX and XTy onto the driver and use an in-core solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty)
Estimating β • Solve the following linear system to get least-squares estimate of ß T T X X ˆ X y
• Fetch XTX and XTy onto the driver and use an in-memory solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty)
→ We have implemented distributed linear regression! (would need to add a bias term in a real implementation)
Overview • • • • •
Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of XTX
Underlying systems • currently: prototype on Apache Spark – fast and expressive cluster computing system – general computation graphs, in-memory primitives, rich API, interactive shell • future: add Apache Flink – database-inspired distributed processing engine – emerged from research by TU Berlin, HU Berlin, HPI – functionality similar to Apache Spark, adds data flow optimization and efficient out-of-core execution
Runtime & Optimization • Execution is defered, user composes logical operators
val C = X.t %*% X I.writeDrm(path);
• Computational actions implicitly trigger optimization (= selection of physical plan) and execution
val inMemV = (U %*% M).collect
• Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths
Optimization Example •
Computation of XTX in example
val drmXtX = drmX.t %*% drmX •
Naïve execution
1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) •
Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication
Optimization Example •
Computation of XTX in example
val drmXtX = drmX.t %*% drmX •
Naïve execution
1st pass: transpose X (requires repartitioning of X) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) •
Transpose Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication
X
Optimization Example •
Computation of XTX in example
val drmXtX = drmX.t %*% drmX •
XT X
Naïve execution
1st pass: transpose X (requires repartitioning of X)
MatrixMult
2nd pass: multiply result with X (expensive, potentially requires repartitioning again) •
Transpose Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication
X
X
Optimization Example •
Computation of XTX in example
val drmXtX = drmX.t %*% drmX •
XT X
XT X
MatrixMult
TransposeTimes-Self
Naïve execution
1st pass: transpose X (requires repartitioning of X) 2nd pass: multiply result with X (expensive, potentially requires repartitioning again) •
Transpose Logical optimization Optimizer rewrites plan to use specialized logical operator for Transpose-Times-Self matrix multiplication
X
X
X
Tranpose-Times-Self • Mahout computes XTX via row-outer-product formulation – executes in a single pass over row-partitioned X m
T
X X
i0
x i x i
T
Tranpose-Times-Self • Mahout computes XTX via row-outer-product formulation – executes in a single pass over row-partitioned X m
T
X X
i0
XT
x i x i
T
Tranpose-Times-Self • Mahout computes XTX via row-outer-product formulation – executes in a single pass over row-partitioned X m
T
X X
i0
x XT
X
x i x i
T
Tranpose-Times-Self • Mahout computes XTX via row-outer-product formulation – executes in a single pass over row-partitioned X m
T
X X
i0
x XT
= X
x x1• x1•T
+
x i x i
T
Tranpose-Times-Self • Mahout computes XTX via row-outer-product formulation – executes in a single pass over row-partitioned X m
T
X X
x i x i
T
x
+
i0
x XT
= X
x x1• x1•T
+
x2• x2•T
Tranpose-Times-Self • Mahout computes XTX via row-outer-product formulation – executes in a single pass over row-partitioned X m
T
X X
x i x i
T
x
+
i0
x XT
= X
x x1• x1•T
+
x2• x2•T
x
x3• x3•T
+
Tranpose-Times-Self • Mahout computes XTX via row-outer-product formulation – executes in a single pass over row-partitioned X m
T
X X
x i x i
T
x
+
i0
x XT
= X
x x1• x1•T
+
x2• x2•T
x
x3• x3•T
+
x
x4• x4•T
Overview • • • • •
Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of XTX
Physical operators for Transpose-Times-Self • Two physical operators (concrete implementations) available for Transpose-Times-Self operation – standard operator “AtA” – operator “AtA_slim”, specialized implementation for “tall & skinny” matrices (many rows, few columns)
• Optimizer must choose – currently: depends on user-defined threshold for number of columns – ideally: cost based decision, dependent on estimates of intermediate result sizes
XT X
TransposeTimes-Self
X
Physical operator „AtA“
1 1 0
1
1
0
1
0
1
X
0 0 1
Physical operator „AtA“ 1 1
1
1
0
1
0 0
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
X
0
0
1
1
X2 worker 2
Physical operator „AtA“ for 1st partition
1 1
1
1
0
1
0 0
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
for 1st partition
X
0
0
1
1
X2 worker 2
Physical operator „AtA“ for 1st partition 1 1 1
1 1
1
1
0
1
1
1
0
0 0
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
for 1st partition
X
0
0
1
1
X2 worker 2
0 0 0
0
1
1
Physical operator „AtA“ for 1st partition
1 1
1
1
0
1
0 0
1 1 1
1
1
0
1 1 0
0
1
0
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
for 1st partition
X
0
0
1
1
X2 worker 2
0 0 0
0
1
1
Physical operator „AtA“ for 1st partition
1 1
1
1
0
1
0 0
1 1 1
1
1
0
1 1 0
0
1
0
for 2nd partition
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
for 1st partition
X
0
0
1
1
X2 worker 2
0 0 0
0
1
1
for 2nd partition
Physical operator AtA for 1st partition
1 1
1
1
0
1
0 0
1
1
0
1
0
1
0 0 1
1
1
0
1 1 0
0
1
0
for 2nd partition 1 1 0
A1 1 1 0
1 1 1
1
1
0
worker 1
for 1st partition
A
0
0
1
1
A2 worker 2
0 0 0
0
1
1
for 2nd partition 1 0 1
0
1
1
Physical operator „AtA“ for 1st partition
1 1
1
1
0
1
0 0
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
1 1 1
1
1
0
1 1 0
0
1
0
for 2nd partition 1 1 0
1
1
0
1 1 0
0
1
0
for 1st partition
X
0
0
1
1
X2 worker 2
0 0 0
0
1
1
for 2nd partition 1 0 1
0
1
1
Physical operator „AtA“ for 1st partition
1 1
1
1
0
1
0 0
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
1 1
1
1
1
1
1 0
0
1
0
0
0 0 0 0
for 2nd partition 1 0
1
1
0
0
1 0
0
1
0
0
0 0 0 0
for 1st partition
X
0
0
1
1
X2 worker 2
0 0
0
0
0
0
0 0
for 2nd partition 0 0
0
1
0
1
1 1
Physical operator „AtA“ for 1st partition
1 1
1
1
0
1
0 0
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
1 1
1
1
1
1
1 0
0
1
0
0
0 0 0 0
for 2nd partition 1 0
1
1
0
0
1 0
0
1
0
0
0 0
∑
2 1
1
2
1
1
0 0
0 0
worker 3
for 1st partition
X
0
0
1
1
X2 worker 2
0 0
0
0
0
0
0 0
for 2nd partition 0 0
0
1
0
1
1 1
∑
2 0
1
3
0
1
worker 4 XT X
1 1
Physical operator „AtA_slim“
1 1 0
1
1
0
1
0
1
X
0 0 1
Physical operator „AtA_slim“ 1 1
1
1
0
1
0 0
X1 1 1 0
1
1
0
1
0
1
0 0 1
worker 1
X
0
0
1
1
X2 worker 2
Physical operator „AtA_slim“ 1 1
1
1
0
1
0 0
2
1
1
0
1
0
1
0 0 1
2
1
1
2
0 0 0 0
X1T X1
X1 1 1 0
1
worker 1
X
0
0
1
1
0
0
0
0
0
1
X2T X2
X2 worker 2
0 0 1 1
Physical operator „AtA_slim“ 1 1
1
1
0
1
0 0
2
1
1
0
1
0
1
0 0 1
2
1
1
2
0 0 0 0
X1T X1
X1 1 1 0
1
worker 1
X
0
0
1
1
X1T X1 + X2T X2 0
0
0
0
0
1
X2T X2
X2 worker 2
0 0 1 1
2 1 2 0
1
2
1
1
1
3
0
1
XT X driver
0 0 1 1
Summary • MapReduce outdated as abstraction for distributed machine learning • R/Matlab-like DSL for declarative implementation of algorithms • Automatic compilation, optimization and parallelization of programs written in this DSL • Execution on novel distributed engines like Apache Spark and Apache Flink
Thank you. Questions? Tutorial for playing with the new Mahout DSL: http://mahout.apache.org/users/sparkbindings/play-with-shell.html Apache Flink Meetup in Berlin: http://www.meetup.com/Apache-Flink-Meetup/ Follow me on twitter @sscdotopen