Splash: User-friendly Programming Interface for Parallelizing

0 downloads 0 Views 372KB Size Report
Jun 24, 2015 - Splash consists of a programming interface and an execution engine ... The parallelization is communication efficient, meaning that its separate.
arXiv:1506.07552v1 [cs.LG] 24 Jun 2015

Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms Yuchen Zhang Michael I. Jordan Department of Electrical Engineering and Computer Science University of California, Berkeley CA 94720 {yuczhang,jordan}@eecs.berkeley.edu Abstract Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems. Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without concerning any detail about distributed computing. The algorithm is then automatically parallelized by a communication-efficient execution engine. We provide theoretical justifications on the optimal rate of convergence for parallelizing stochastic gradient descent. The real-data experiments with stochastic gradient descent, collapsed Gibbs sampling, stochastic variational inference and stochastic collaborative filtering verify that Splash yields order-of-magnitude speedup over single-thread stochastic algorithms and over parallelized batch algorithms. Besides its efficiency, Splash provides a rich collection of interfaces for algorithm implementation. It is built on Apache Spark and is closely integrated with the Spark ecosystem.

1

Introduction

Stochastic optimization algorithms process a large-scale dataset by sequentially processing random subsamples. This processing scheme makes the per-iteration cost of the algorithm much cheaper than that of batch processing algorithms while still yielding effective descent. Indeed, for convex optimization, the efficiency of stochastic gradient descent (SGD) and its variants has been established both in theory and in practice [29, 3, 26, 5, 23, 12]. For non-convex optimization, stochastic methods achieve state-of-the-art performance on a broad class of problems, including matrix factorization [13], neural networks [14] and representation learning [25]. Stochastic algorithms are also widely used in the Bayesian setting for finding approximations to posterior distributions; examples include Markov chain Monte Carlo, expectation propagation [18] and stochastic variational inference [10]. Although classical stochastic approximation procedures are sequential, it is clear that they also present opportunities for parallel and distributed implementations that may yield significant additional speedups. One active line of research studies asynchronous parallel updating schemes in the setting of a lock-free shared memory [21, 6, 17, 31, 9, 27]. When the time delay of concurrent updates are bounded, it is known that such updates preserve statistical correctness [1, 17]. Such asynchronous algorithms yield significant speedups on multi-core machines. On distributed 1

systems connected by commodity networks, however, the communication requirements of such algorithms can be overly expensive. If messages are frequently exchanged across the network, the communication cost will easily dominate the computation cost. There has also been a flurry of research studying the implementation of stochastic algorithms in the fully distributed setting [32, 30, 19, 7, 16, 11]. Although promising results have been reported, the implementations proposed to date have their limitations—they have been designed for specific algorithms, or they require careful partitioning of the data to avoid inconsistency. In this paper, we propose a general framework for parallelizing stochastic algorithms on multinode distributed systems. Our framework is called Splash (System for Parallelizing Learning Algorithms with Stochastic Methods). Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without thinking about issues of distributed computing. The algorithm is then automatically parallelized by the execution engine. The parallelization is communication efficient, meaning that its separate threads don’t communicate with each other until all of them have processed a large bulk of data. Thus, the inter-node communication need not be a performance bottleneck. To ensure parallelizability, we ask the user to implement a slightly stronger version of the base sequential algorithm: it needs to be capable of processing weighted samples. Many stochastic algorithms can be generalized to the processing weighted samples without sacrificing computational efficiency. Since the processing of a weighted sample can be carried out within a sequential paradigm, this requirement does not force the user to think about a distributed implementation. Splash converts a distributed processing task into a sequential processing task using distributed versions of averaging and reweighting. During the execution of the algorithm, we let every thread sequentially process its local data. The local updates are iteratively averaged to construct the global update. Critically, however, although averaging reduces the variance of the local updates, it doesn’t reduce their bias. In contrast to the sequential case in which a thread processes a full sequence of random samples, in the distributed setting every individual thread touches only a small subset of samples, resulting in a significant bias relative to the full update. Our reweighting scheme addresses this problem by feeding the algorithm with weighted samples, ensuring that the total weight processed by each thread is equal to the number of samples in the full sequence. This helps individual threads to generate nearly-unbiased estimates of the full update. Using this approach, Splash automatically detects the best degree of parallelism for the algorithm. Theoretically, we prove that Splash achieves the optimal rate of convergence for parallelizing SGD, assuming that the objective function is smooth and strongly convex. We conduct extensive experiments on a variety of stochastic algorithms, including algorithms for logistic regression, topic modeling and personalized recommendation. The experiments verify that Splash can yield orders-of-magnitude speedups over single-thread stochastic algorithms and over parallelized batch algorithms. Besides its performance, Splash is a contribution on the distributed computing systems front, providing a flexible interface for the implementation of stochastic algorithms. We build Splash on top of Apache Spark [28], a popular distributed data-processing framework for batch algorithms. Splash takes the standard Resilient Distributed Dataset (RDD) of Spark as input and generates an RDD as output. The data structure also supports default RDD operators such as Map and Reduce, ensuring convenient interaction with Spark. Because of this integration, Splash works seamlessly with other data analytics tools in the Spark ecosystem, enabling a single system to address the entire analytics pipeline.

2

2

Background

In this paper, we focus on the stochastic algorithms which take the following general form. At step t, the algorithm receives a data element zt and a vector of shared variables vt . Based on these values the algorithm performs an incremental update ∆(zt , vt ) on the shared variable: vt+1 ← vt + ∆(zt , vt )

(1)

For example, stochastic gradient descent (SGD) fits this general framework. Letting x denote a random data element x and letting w denote a parameter vector, SGD performs the update: t ← t + 1 and

w ← w − ηt ∇ℓ(w; x)

(2)

where ℓ(·; x) is the loss function associated with the element and ηt is the stepsize at time t. In this case both w and t are shared variables. In Appendix A, we discuss additional stochastic algorithms and their implementations via Splash, including the collapsed Gibbs sampling, stochastic variational inference and a stochastic algorithm for collaborative filtering. There are several stochastic algorithms using local variables in their computation. Any local variables is associated with a specific data element. For example, the collapsed Gibbs sampling algorithm for LDA [8] maintains a topic assignment for each word, which is stored as a local variable. The stochastic dual coordinate ascent (SDCA) algorithm [24] maintains a dual variable for each data element, which is also stored as a local variable. To support the development of such algorithms, Splash allows creating and updating local variables during the algorithm execution. Nevertheless, since the system carries out automatic reweighting and rescaling (refer to Section 4.2), any improper usage of the local variable may cause weight mismatching issues. The system thus provides a user-friendly interface called “delayed operator” which substitutes the functionality of local variables in many situations. In particular, the user can declare a variable operation and suspend its execution to the next time when the same element is processed. Both collapsed Gibbs sampling and SDCA can be implemented via a delayed operation. See Appendix A for a concrete example. The consistency of this operation is always guaranteed. Shared variables and local variables are stored separately. In particular, shared variables are replicated on every data partition. Their values are iteratively synchronized. The local variables, in contrast, are stored with the associated data elements and will never be synchronized. This storage scheme optimizes the communication efficiency and allows for convenient element-wise operations.

3

Programming with Splash

Splash allows the user to write self-contained Scala applications using its programming interface. The goal of the programming interface is to make distributed computing transparent to the user. Splash extends Apache Spark to provide an abstraction called a Parametrized RDD for storing and maintaining the distributed dataset. The Parametrized RDD is based on the Resilient Distributed Dataset (RDD) [28] used by Apache Spark. It can be created from a standard RDD object: val paramRdd = new ParametrizedRDD(rdd). We provide a rich collection of interfaces to convert the components of Parametrized RDD to standard RDDs, facilitating the interaction between Splash and Spark. To run algorithms on the Parametrized RDD, the user creates a data processing function called process which implements

3

the stochastic algorithm, then calls the method paramRdd.run(process) to start running the algorithm. In the default setting, the execution engine takes a full pass over the dataset by calling run() once. This is called one iteration of the algorithm execution. The inter-node communication occurs only at the end of the iteration. The user may call run() multiple times to take multiple passes over the dataset. The process function is implemented using the following format: def process(elem : Any, weight : Int, sharedVar : VarSet, localVar : VarSet){. . . }

It takes four arguments as input: a single element elem, the weight of the element, the shared variable sharedVar and the local variable localVar associated with the element. The goal is to update sharedVar and localVar according to the input. Splash provides multiple ways to manipulate these variables. Both local and shared variables are manipulated as key-value pairs. The key must be a string; the value can be either a real number or an array of real numbers. Inside the process implementation, the value of local or shared variables can be accessed by localVar.get(key) or sharedVar.get(key). The local variable can be updated by setting a new value for it: localVar.set(key, value). The shared variable is updated by operators. For example, using the add operator, the expression sharedVar.add(key, delta) adds a scalar delta to the variable. The SGD updates (2) can be implemented via several add operators. Other operators supported by the programming interface, including delayed add and multiply, are introduced in Section 4.2. Similar to the standard RDD, the user can perform map and reduce operations directly on the Parametrized RDD. For example, after the algorithm terminates, the expression val loss = paramRdd.map(evalLoss).sum() evaluates the element-wise losses and aggregates them across the dataset.

4

Strategy for Parallelization

In this section, we first discuss two naive strategies for parallelizing a stochastic algorithm and their respective limitations. These limitations motivate the strategy that Splash employs.

4.1

Two naive strategies

We denote by ∆(S) the incremental update on variable v after processing the set of samples S. Suppose that there are m threads and each thread processes a subset Si of S. If the i-th thread increments the shared variable by ∆(Si ), then the accumulation scheme constructs a global update by accumulating local updates: m X ∆(Si ). (3) vnew = vold + i=1

The scheme (3) provides a good approximation to the full update if the batch size |Di | is sufficiently small [1]. However, frequent communication is necessary to ensure a small batch size. For

4

distributed systems connected by commodity networks, frequent communication is prohibitively expensive, even if the communication is asynchronous. Applying scheme (3) on a large batch may easily lead to divergence. Taking SGD as an example: if all threads starts from the same vector wold , then after processing a large batch, the new vector on each thread will be close to the optimal solution w∗ . If the variable is updated by formula (3), then we have m m X X (w∗ − wold ) = (m − 1)(w∗ − wold ). ∆(Si ) ≈ wold − w∗ + wnew − w∗ = wold − w∗ + i=1

i=1

Clearly SGD will diverge if m ≥ 3. One way to avoid divergence is to multiply the incremental change by a small coefficient. When the coefficient is 1/m, the variable is updated by m

vnew

1 X = vold + ∆(Si ). m

(4)

i=1

This averaging scheme usually avoids divergence. However, since the local updates are computed on 1/mth of S, they make little progress comparing to the full sequential update. Thus the algorithm converges substantially slower than its sequential counterpart after processing the same amount of data. See Appendix D for the evidence of this phenomenon on simulated data and see Appendix H.5 for evidence on real data.

4.2

Our strategy

We now turn to describe the Splash strategy for combining parallel updates. First we introduce the operators that Splash supports for manipulating shared variables. Then we illustrate how conflicting updates are combined by the reweighting scheme. Operators The programming interface allows the user to manipulate shared variables inside their algorithm implementation via operators. An operator is a function that maps a real number to another real number. Splash supports three types of operators: add, delayed add and multiply. The system employs different strategies for parallelizing different types of operators. The add operator is the the most commonly used operator. When the operation is performed on variable v, the variable is updated by v ← v + δ where δ is a user-specified scalar. The SGD update (2) can be implemented using this operator. The delayed add operator performs the same mapping v ← v + δ; however, the operation will not be executed until the next time that the same element is processed by the system. Delayed operations are useful in implementing sampling-based stochastic algorithms. In particular, before the new value is sampled, the old value should be removed. This “reverse” operation can be declared as a delayed operator when the old value was sampled, and executed before the new value is sampled. See Appendix A for a concrete example with the collapsed Gibbs sampling implementation. The multiply operator scales the variable by v ← γ · v where γ is a user-specified scalar. The multiply operator is especially efficient for scaling high-dimensional arrays. The array multiplication costs O(1) computation time, independent of the dimension of the array. See Appendix A for a more detailed discussion of the multiply operator.

5

Reweighting Suppose that there are m thread running in parallel. Note that all Splash operators are linear transformations. When these operators are applied sequentially, they merge into a single linear transformation. Taking a variable v, we write thread i’s transformation of v by v ← Γ(Si ) · v + ∆(Si ) + T (Si ),

(5)

v ← Γ(mSi ) · v + ∆(mSi ) + T (mSi )

(6)

where Si is the sequence of samples processed by thread i. Γ(Si ) is the scale factor resulting from the multiply operators, ∆(Si ) is the term resulting from the add operators, and T (Si ) is the term resulting from the delayed add operators executed in the current iteration. A detailed discussion of the construction of these transformations is given in Appendix B. As discussed in Section 4.1, directly combining these transformations leads to divergence or slow convergence (or both). The reweighting scheme addresses this dilemma by assigning weights to the samples. Since the update (5) is constructed on a fraction 1/m of the full sequence S, we reweight every element by m in the local sequence. After reweighting, the data distribution of Si will approximate the data distribution of S, making update (5) a nearly-unbiased estimate of the full sequential update. More concretely, the algorithm manipulates the variable by taking sample weights into account. An m-weighted sample tells the algorithm that it appears m times consecutively in the sequence. We rename the transformations by Γ(mSi ), ∆(mSi ) and T (mSi ), emphasizing that they are constructed by processing m-weighted samples. Then we redefine the transformation of thread i by and define the global update by vnew =

m m  X 1 X T (mSi ). Γ(mGi ) · vold + ∆(mSi ) + m

(7)

i=1

i=1

Equation (7) combines the transformations of all threads. The terms Γ(mSi ) and ∆(mSi ) are scaled by a factor 1/m because they were constructed on m times the amount of data. The term T (mSi ) is not scaled, because the delayed operators were declared in earlier iterations, independent of the reweighting. Finally, the coefficient 1/m should multiply all delayed operators declared in the current iteration since these delayed operators were constructed on m times the amount of data. Appendix D presents a concrete example illustrating the consequence of reweighting. Determining the degree of parallelism To determine the thread number m, the execution engine partitions the available cores into different-sized groups. Suppose that group i contains mi cores. These cores will execute the algorithm tentatively on mi parallel threads. The best thread number is then determined by cross-validation and is dynamically updated. See Appendix E for a detailed description. To find the best degree of parallelism, the base algorithm needs to be robust in terms of processing a wide range of sample weights. Generalizing stochastic algorithms Many stochastic algorithms can be generalized to processing weighted samples without sacrificing computational efficiency. The most straightforward generalization is to repeat the single-element update m times. For example, we can generalize the SGD updates (2) by t←t+m

and w ← w − mηt ∇ℓ(w; x)

6

(8)

assuming that the sample weight is m. In Appendix C, we demonstrate how to generalize the collapsed Gibbs sampling method, the stochastic variational inference method and the stochastic collaborative filtering algorithm to processing weighted data.

5

Convergence Analysis

In this section, we study the SGD convergence when it is parallelized by Splash. The goal of SGD is to minimize an empirical risk function 1 X ℓ(w; x), L(w) = |S| x∈S

Rd

where S is a fixed dataset and w ∈ is the vector to be minimized over. Suppose that there are m threads running in parallel. At every iteration, thread i randomly draws (with replacement) a subset of samples Si of length n from the dataset S. The thread sequentially processes Si by SGD. The per-iteration update is  t ← t + m and w ← w + ΠW (w − mηt ∇ℓ(w; x)) − w , (9)

where the sample weight is equal to m. We have generalized the update (8) by introducing ΠW (·) as a projector to a feasible set W of the vector space. Projecting to the feasible set is a standard post-processing step for an SGD iterate. At the end of the iteration, updates are synchronized by equation (7). This is equivalent to computing: m  1 X tnew = told + mn and wnew = wold + ∆(mDi ) . (10) m i=1

w∗

We denote by := arg minw∈W L(w) the minimizer of the objective function, and denote by wT the combined vector after the T -th iteration.

General convex function For general convex functions, we start by introducing three additional k be the value of vector w at iteration k, when thread i is processing the j-th element terms. Let wi,j k be the stepsize associated with that update. We define a weighted average vector: of Si . Let ηi,j PT Pm Pn k k k=1 i=1 j=1 ηi,j wi,j T w = PT Pm Pn . k k=1 i=1 j=1 ηi,j

Note that w T can be computed together with wT . For general convex L, the function value L(w T ) converges to L(w∗ ). See Appendix F for the proof. Theorem 1. Assume that k∇ℓ(w; x)k2 is bounded P for all (w, x) ∈ WP × S. Also assume that ηt is ∞ 2 a monotonically decreasing function of t such that ∞ η = ∞ and t t=1 t=1 ηt < ∞. Then we have lim E[L(w T ) − L(w∗ )] = 0.

T →∞

Smooth and strongly convex function We now turn to study smooth and strongly convex objective functions. We make three assumptions on the objective function. Assumption A restricts the optimization problem in a bounded convex set. Assumption B and Assumption C require the objective function to be sufficiently smooth and strongly convex in that set.

7

Assumption A. The feasible set W ⊂ Rd is a compact convex set of finite diameter R. Moreover, w∗ is an interior point of W ; i.e., there is a set Uρ := {w ∈ Rd : kw − w∗ k2 < ρ} such that Uρ ⊂ W . Assumption B. There are finite constants L, G and H such that k∇2 L(w; x) − ∇2 ℓ(w∗ ; x)k2 ≤ Lkw − w∗ k2 , k∇ℓ(w; x)k2 ≤ G and k∇2 ℓ(w; x)k2 ≤ H for all (w, x) ∈ W × S. Assumption C. The objective function L is λ-strongly convex over the space W , meaning that ∇2 L(w)  λId×d for all w ∈ W . λ As a preprocessing step, we construct an Euclidean ball B of diameter D := 4(L+G/ρ 2 ) which ∗ contains the optimal solution w . The ball center can be found by running the sequential SGD for a constant number of steps. During the Splash execution, if the combined vector wT ∈ / B, then we T ∗ project it to B, ensuring that the distance between w and w is bounded by D. Introducing this projection step simplifies the theoretical analysis, but it may not be necessary in practice. Under these assumptions, we provide an upper bound on the mean-squared error of wT . The following theorem shows that the mean-square error decays as 1/(T mn), inversely proportionally to the total number of processed samples. It is the optimal rate of convergence among all optimization algorithms which relies on noisy gradients [20]. See Appendix G for the proof. 2 , then the output wT has Theorem 2. Under Assumptions A-C, if we choose the stepsize ηt = λt mean-squared error:   C1 C2 4G2 + , (11) E kwT − w∗ k22 ≤ 2 + 1/2 3/2 λ T mn T m n T n2 where C1 and C2 are constants independent of T , m and n.

When the local sample size n is sufficiently larger than the thread number m (which is typically true), the last two terms on the right-hand side of bound (11) are negligibly small. Thus, the mean-squared error is dominated by the first term, which scales as 1/(T mn).

6

Experiments

In this section, we report the empirical performance of Splash. We solve three machine learning problems: logistic regression, topic modelling and movie recommendation. Our implementation of Splash runs on an Amazon EC2 cluster with eight nodes. Each node is powered by an eight-core Intel Xeon E5-2665 with 30GB of memory and was connected to a commodity 1GB network, so that the cluster contains 64 cores. For all applications, the system chooses 64 threads as the best degree of parallelism. Datasets For logistic regression, we use the Covtype, RCV1 and MNIST 8M datasets from the LIBSVM Data website [4]. Among the three datasets, Covtype and MNIST have dense features, RCV1 has sparse features. For topic modelling, we use the NIPS, Enron and NYtimes datasets from the UCI Machine Learning Repository [15], which are represented by bag-or-word histograms. For movie recommendation, we use the movie rental dataset of Netflix. This dataset contains 100M movie ratings made by 480K users on 17K movies.

8

2500

3 SGD (1 thread) Splash + SGD (64 threads)

running time (sec)

running time (sec)

1500

1000

500

1.5 1

0

Covtype

RCV1

MNIST 8M

NIPS

(a) Stochastic Gradient Descent

2

2

0.5

0

2.5

4

Gibbs Sampling (1 thread) Splash + Gibbs (64 threads)

2.5

2000

×10

×10

Enron

NYTimes

(b) Collapsed Gibbs Sampling

4

1400 SVI (1 thread) Batch VI (64 threads) Splash + SVI (64 threads)

1200

Stochastic (1 thread) Batch (64 threads) Splash + Stochastic (64 threads)

running time (sec)

running time (sec)

1000 1.5

1

0.5

800 600 400 200

0

0 NIPS

Enron

NYTimes

AUC = 0.91

(c) Stochastic Variational Inference

AUC = 0.94

(d) Stochastic Collaboerative Filtering

Figure 1: Running time of different algorithms for achieving a fixed accuracy. The accuracies are measured by (a) logistic loss achieved by SGD taking 30 passes over the dataset; (b) perplexity score achieved by collapsed Gibbs sampling taking 100 passes over NIPS and Enron, or taking 15 passes over NYTimes; (c) predictive log-likelihood achieved by the variational inference algorithm taking 15 passes over the dataset; (d) a lower AUC score (AUC = 0.91) and a higher AUC score (AUC = 0.94). Algorithms For logistic regression, we compare batch gradient descent (GD), stochastic gradient descent (SGD) and the parallel version of SGD under Splash. For learning topic models, we consider two classes of algorithms. The first class are sampling-based algorithms, where we compare collapsed Gibbs sampling [8] and its parallel version under Splash. The second class are variational inference based algorithms, where we compare batch variational inference (VI) [2], stochastic variational inference (SVI) [10] and the parallel version of SVI under Splash. For movie recommendation, we compare the batch algorithm based on alternating loss minimization, the stochastic algorithm called Bayesian Personalized Ranking (BPR) [22] and the parallel version of BPR under Splash. Evaluation Metrics To evaluate the performance of the algorithms, we plot the running time of different algorithms for achieving the same learning accuracy. For logistic regression, the accuracy is measured by the logistic loss. For sampling-based algorithms, the accuracy is measured by the perplexity score on the held-out data. For variational-inference-based algorithms, the accuracy is 9

4

Time Ratio

3

2

1

0 Cotype(SGD)

RCV1(SGD)

MNIST(SGD)

NIPS(Gibbs)

Enron(Gibbs) NYTimes(Gibbs)

NIPS(SVI)

Enron(SVI)

NYTimes(SVI) Netflix(BPR)

Figure 2: The parallel/sequential time ratio for each thread processing the same amount of data. The ratio is computed between the Splash implementation and the single-thread stochastic algorithm. The Splash running time include the communication overhead and the time delay due to the synchronization. measure by the predictive log-likelihood. For movie recommendation, the accuracy is measured by the Area Under Curve (AUC) score on the test set. Experimental results Figure 1 compares the running time of different algorithms for achieving the same learning accuracy. For logistic regression, the single-thread SGD converges substantially faster than the 64-thread batch GD (see Appendix H.1 for the plot). Figure 1(a) shows that Splash yields 16x, 37x and 28x speedups over the single-thread SGD, thus is orders-of-magnitude faster than the batch GD. To the best of our knowledge, no existing machine learning system achieves the same efficiency in parallelizing SGD on commodity multi-node clusters. For learning topic models, Figure 1(b) shows that Splash yields 38x, 149x and 30x speedups over the single-thread collapsed Gibbs sampling algorithm. Interestingly, the speedup rate is greater than 64x on the Enron dataset, suggesting that parallelization accelerates the convergence. For parallelizing stochastic variational inference (SVI), the speedup rates are 19x, 20x and 16x. Figure 1(c) shows that Splash is also faster than the 64-thread VI algorithm. In particular, Splash is 3.2x, 5.4x and 17.5x faster than the 64-thread VI on the three dataset. The speedup rate increases as the dataset gets bigger. For movie recommendation, Figure 1(d) shows that the 64-thread batch algorithm converges faster than the single-thread stochastic algorithm. The sparsity of the Netflix dataset helps the batch algorithm to achieve good performance. Nevertheless, if we parallelize the stochastic algorithm via Splash, then we get further speedups. Depending on the accuracy we want, Splash is 12x-40x faster than the single-thread stochastic algorithm, and 3x-5.5x faster than the batch algorithm. Figure 2 plots the parallel/sequential time ratio for a thread processing the same amount of data. This ratio is computed between the running time of Splash and the running time of the single-thread stochastic algorithm. On most datasets, the time delay due to the parallelization is not significant. The exceptions are the collapsed Gibbs sampling algorithm on all datasets and the SVI algorithm on the Enron dataset. Indeed, the topic model has a large collection of parameters to synchronize which increases the communication cost. Since the Enron dataset consists of short emails, processing this dataset requires less computation, further increasing the ratio of communication overhead. We report more experiments in Appendix H, including the implementation details, the convergence plots for all algorithms that we have tested and a detailed discussion on the relative

10

performance. We also compare Splash with other parallelization schemes. Overall our experimental results show that Splash outperforms the parallelization schemes that we tested in terms of efficiency and robustness.

References [1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In NIPS, pages 873–881, 2011. [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [3] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. [4] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. [5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011. [6] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: convergence analysis and network scaling. Automatic Control, IEEE Transactions on, 57(3):592–606, 2012. [7] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In SIGKDD, pages 69–77. ACM, 2011. [8] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. [9] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More effective distributed ML via a stale synchronous parallel parameter server. In NIPS, pages 1223–1231, 2013. [10] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. [11] M. Jaggi, V. Smith, M. Tak´ ac, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan. Communicationefficient distributed dual coordinate ascent. In NIPS, pages 3068–3076, 2014. [12] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013. [13] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009. [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012. [15] M. Lichman. UCI machine learning repository, 2013. [16] C. Liu, H.-c. Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In WWW, pages 681–690. ACM, 2010. [17] J. Liu, S. J. Wright, C. R´e, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. arXiv preprint arXiv:1311.1873, 2013. [18] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.

11

[19] D. Newman, P. Smyth, M. Welling, and A. U. Asuncion. Distributed inference for latent Dirichlet allocation. In NIPS, pages 1081–1088, 2007. [20] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011. [21] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693–701, 2011. [22] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. BPR: Bayesian personalized ranking from implicit feedback. In UAI, pages 452–461. AUAI Press, 2009. [23] M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013. [24] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013. [25] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, pages 1096–1103. ACM, 2008. [26] L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. In NIPS, pages 2116–2124, 2009. [27] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. arXiv preprint arXiv:1312.7651, 2013. [28] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI. USENIX Association, 2012. [29] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML, page 116. ACM, 2004. [30] Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical optimization. In NIPS, pages 1502–1510, 2012. [31] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A fast parallel SGD for matrix factorization in shared memory systems. In RecSys, pages 249–256. ACM, 2013. [32] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, pages 2595–2603, 2010.

12

Appendix A

More Examples of Stochastic Algorithm

In this section, we describe three additional stochastic algorithms: the collapsed Gibbs sampling algorithm, the stochastic variational inference algorithm and the stochastic collaborative filtering algorithm.

A.1

Collapsed Gibbs sampling

Latent Dirichelet Allocation [2] (LDA) is an unsupervised model for learning topics from documents. The goal of LDA is to infer the topic at each word in each document. The Collapsed Gibbs sampling algorithm for LDA iteratively takes a word w from document d, and sample the topic of w by P (topic = k|d, w) ∝

(nk|d + α)(nw|k + β) . nk + βW

(12)

Here, W is the size of the vocabulary; nk|d is the number of words in document d that has been assignedPto topic k; nw|k is the total number of times that word w is assigned to topic k and nk := w nw|k . All counts exclude the current word w. The constants α and β are hyperparameters of the LDA model. The counts nk|d , nw|k and nk are shared variables. Suppose that the topic of the word has been changed from k to k′ , then the algorithm performs the incremental updates: nk|d ← nk|d − 1,

nk′ |d ← nk′ |d + 1,

nw|k ← nw|k − 1,

nw|k′ ← nw|k′ + 1,

nk ← nk − 1 and n′k ← n′k + 1.

(13) (14)

The algorithm terminates when the shared variables converge. Splash implementation The collapsed Gibbs sampling algorithm can be implemented via several add operations. Note that the update (13) reverses the word count increment that has been performed earlier. Thus, it can be implemented via the delayed add operator. More precisely, whenever the word count is incremented by one, the stochastic algorithm declares a delayed operator which decreases the word count by one. This operator will be executed at the next time when the same word is processed.

A.2

Stochastic Variational Inference

Stochastic variational inference (SVI) is another efficient approach to learning the LDA model [10]. SVI models a word w in document d as follow: the probability that the word belongs to topic k is proportional to a parameter γdk ; the probability that the topic k generates word d is proportional to another parameter λkw . At iteration t, a document d is randomly drawn from the corpus, and the parameters γdk and λkw are recomputed using variational inference (See e.g.[10, Figure 6] for ekw , SVI performs the algorithm details). Let the recomputed parameters denoted by γ edk and λ update γdk ← γ edk ,

ekw λkw ← (1 − ρt )λkw + ρt D λ 13

and t ← t + 1.

(15)

where D is the total number of documents. The coefficient ρt is a user-specified learning rate decreasing with time t. For SVI, λkw and t are shared variables. The document-topic parameter γdk is not a shared variable since its value won’t be used by other documents. It is not necessary to store γdk , since its value will be recomputed once the same document is processed in the next iteration. Nevertheless, storing it to initialize the local variational inference will make the it converging faster, thus improving the algorithm’s efficiency. Splash implementation Splash provides both the add operator and the multiply operator for updating shared variables (See, e.g. Section 4.2). We use the SVI update (15) as an example to illustrate how a combination of the add operator and the multiply operator could be useful. The update (15) on λkw can be written as: Multiply : Add :

λkw ← γ · λkw

where γ = 1 − ρt and ekw . ← λkw + δ where δ = ρt D λ

(16)

λkw

(17)

Equivalently, we can implement the update (15) by a single add operator: λkw ← λkw + δ

ekw − λkw ). where δ = ρt (D λ

(18)

If the variables {λkw } are stored in an array named by λ, then the step (16) can be declared as an array multiplication λ ← γ · λ. This operator has O(1) computation cost in Splash, which is independent of the dimension of the array. The computation cost of step (17) is proportional to ekw }. In contrast, the computation cost of update (18) depends the number of non-zero entries of {λ on the dimension of λ, which might be much greater than that of steps (16)-(17). The user is thus encouraged to use the multiply operator to scale high-dimensional arrays.

A.3

Stochastic Collaborative Filtering

Collaborative filtering has wide applications in personalized ranking and recommender systems. Suppose that we have a set U of users and a set I of items. There are historical record showing that the user u ∈ U has chosen a subset of the items. The goal is to predict the user’s choice in the future. Collaborative filtering assumes that there is a latent vector vu ∈ Rd associated with each user and a latent vector vi associated with each item. The affinity between a user and an item, or the likelihood that the user will choose the item in the future, is measured by the inner product of their latent vectors. There is a successful stochastic algorithm for learning the latent vectors given implicit user feedbacks. The algorithm is called Bayesian Personalized Ranking (BPR) [22]. Given a user-item pair (u, i) where the user has chosen the item, the algorithm randomly sample another item j that has not been chosen, and update the latent vectors of u, i, j by vu ← (1 − ηu λ)vu + ηu σ(hvu , vi − vj ))(vi − vj ) vi ← (1 − ηi λ)vi + ηi σ(hvu , vi − vj ))vu

vj ← (1 − ηj λ)vj − ηj σ(hvu , vi − vj ))vu

(19) (20) (21)

where λ is a hyper-parameters of the algorithm. The learning rates ηu , ηi and ηj are associated with the respective variables. They are determined and updated using the AdaGrad algorithm by 1 Duchi et al. [5]. The function σ(·) is defined by σ(t) = 1+e t . Intuitively, the updates (19)-(21) attract the latent vector of item i towards the user vector and push the latent vector of item j away from the user vector. For this algorithm, both (vu , vi , vj ) and (ηu , ηi , ηj ) are shared variables. 14

Splash implementation Similar to the implementation of stochastic variational inference, the updates (19)-(21) can be implemented via a combination of the add operator and the multiply operator. They can also be implemented by only using the add operator.

B

Constructing Linear Transformation on a Thread

When element-wise operators are sequentially applied, they merge into a single linear transformation. Assume that after processing a local subset S, the resulting transformation can be represented by v ← Γ(S) · v + ∆(S) + T (S) where Γ(S) is the scale factor, ∆(S) is the term resulting from the element-wise add operators, and T (S) is the term resulting from the element-wise delayed add operators declared before the last synchronization. We construct Γ(S), ∆(S) and T (S) incrementally. Let P be the set of processed elements. At the beginning, the set of processed elements is empty, so that we initialize them by Γ(P ) = 1,

∆(P ) = 0 and

ξ(P ) = 0 for P = ∅.

After processing element z, we assume that the user has performed all types of operations, resulting in a transformation taking the form v ← γ(v + t) + δ

(22)

where the scalars γ and δ result from instant operators and t results from the delayed operator. Concatenating transformation (22) with the transformation constructed on set P , we have   v ← γ · Γ(P ) · v + ∆(P ) + T (P ) + t + δ     = γ · Γ(P ) · v + γ · ∆(P ) + δ + γ · T (P ) + γt .

Accordingly, we update the terms Γ, ∆ and T by

Γ(P ∪ {z}) = γ · Γ(P ), ∆(P ∪ {z}) = γ · ∆(P ) + δ and T (P ∪ {z}) = γ · T (P ) + γt

(23)

and update the set of processed elements by S ∪ {z}. After processing the entire local subset, the set P will be equal to S, so that we obtain Γ(S), ∆(S) and T (S).

C

Generalizing Stochastic Algorithms

In this section, we illustrate how to generalize the collapsed Gibbs sampling method, the stochastic variational inference method and the Bayesian personalized ranking algorithm to processing weighted samples. Suppose that the sample weight is m. The collapsed Gibbs sampling updates (13) and (14) are generalized by nk|d ← nk|d − m,

nk′ |d ← nk′ |d + m,

nw|k ← nw|k − m,

nw|k′ ← nw|k′ + m, 15

nk ← nk − m

and

n′k ← n′k + m.

(24) (25)

Note that the update (24) should be implemented as a delayed operator — it was declared in the last iteration but executed in the current iteration. For stochastic variational inference, we generalize its updates (15) by γdk ← γ edk ,

ekw λkw ← (1 − ρt,m )λkw + ρt,m λ

and t ← t + k

(26)

ekw and ρt has the same meaning as in formula (15). The coefficient ρt,m is defined by where γ edk , λ 1 − ρt,m :=

t+m−1 Y i=t

(1 − ρi ).

In fact, formula (26) is the result of repeatedly applying formula (15) for m times. For stochastic collaborative filtering, we generalize the BPR updates (19)-(21) by   vu ← 1 − η u λ vu + η u σ(hvu , vi − vj ))(vi − vj )   vi ← 1 − η j λ vi + η i σ(hvu , vi − vj ))vu   vj ← 1 − η j λ vj − η j σ(hvu , vi − vj ))vu

(27) (28) (29)

where the learning rates η u , η i and η j are defined by η u := min(1, mηu ),

η i := min(1, mηi ) and

η j := min(1, mηj ).

By this definition, the stepsize has been multiplied by the sample weight, but we force the stepsize to be bounded by one. This is because that in stochastic collaborative filtering, having a stepsize greater than one will easily cause divergence.

D

Example: Reweighting in Parallelized SGD

We present a simple example illustrating how the strategy described in Section 4.2 benefits parallelization. Consider the following convex optimization problem: there are N = 3, 000 twodimensional vectors, represented by x1 , . . . , xN , such that xi is randomly and independently drawn from the normal distribution x ∼ N (0, I2×2 ). The goal is to find a two-dimensional vector w which minimizes the weighted distance to all samples. More precisely, the loss function on sample xi is defined by   1 0 T (xi − w) ℓ(w; xi ) := (xi − w) 1 0 100 P and the overall objective function is L(w) := N1 N i=1 ℓ(w; xi ). We want to find the vector that minimizes the objective function L(w). We use the SGD update (8) to solve√the problem. The algorithm is initialized by w0 = (−1, −1)T and the stepsize is chosen by ηt = 1/ t. For parallel execution, the dataset is evenly partitioned into m = 30 disjoint subsets, such that each thread accesses to a single subset, containing 1/30 faction of data. The sequential implementation and the parallel implementations are compared in Figure 3. Specifically, we compare seven types of implementations defined by different strategies: (a) The exact minimizer of L(w). 16

1 (a) Optimal solution (b) Solution with full data (c) Local solutions with unit-weight data (d) Average local solutions in (c) (e) Accumulate local solutions in (c) (f) Local solutions with weighted data (g) Average local solutions in (f)

0.8 0.6 0.4 0.2

(29,8)

0 -0.2 -0.4 -0.6 -0.8 -1 -2

-1

0

1

2

Figure 3: Comparing parallelization schemes on a simple convex optimization problem. Totally N = 3, 000 samples are partitioned into m = 30 batches. Each batch is processed by an independent thread running stochastic gradient descent. Each thread uses either unit-weight data or weighted data (weight = 30). The local solutions are combined by either averaging or accumulation. From the plot, we find that combining weighted solutions achieves the best performance. (b) The solution of SGD achieved by taking a full pass over the dataset. The dataset contains N = 3, 000 samples. (c) The local solutions by 30 parallel threads. Each thread runs SGD by taking one pass over its local data. The local dataset contains 100 samples. (d) Averaging local solutions in (c). This is the averaging scheme described by formula (4). (e) Aggregating local solutions in (c). This is the accumulation scheme described by formula (3). (f) The local solution by 30 parallel threads processing weighted data. Each element is weighted by 30. Each thread runs SGD by taking one pass over its local data. (g) Combining parallel updates by formula (7), setting sample weight m = 30. Under this setting, formula (7) is equivalent to averaging local solutions in (f). In Figure 3, we observe that solution (b) and solution (g) achieve the best performance. Solution (b) is obtained by a sequential implementation of SGD: it is the baseline that parallel algorithms target at approaching. Solution (g) is obtained by Splash with the reweighting scheme. The 17

solutions obtained by other parallelization schemes, namely solution (d) and (e), have poor performances. In particular, the averaging scheme (d) has a large bias relative to the optimal solution. The accumulation scheme (e) diverges far apart from the optimal solution. To see why Splash is better, we compare local solutions (c) and (f). They correspond to the unweighted SGD and the weighted SGD respectively. We find that solutions (c) have a significant bias but relatively small variance. In contrast, solutions (f) have greater variance but much smaller bias. It verifies our intuition that reweighting helps to reduce the bias by enlarging the local dataset. Note that averaging reduces the variance but doesn’t change the bias. It explains why averaging works better with reweighting.

E

Determining Thread Number

Suppose that there are M available cores in the cluster. The execution engine partitions these cores into several groups. Suppose that the i-th group contains mi cores. The group sizes are determined by the following allocation scheme: • Let 4m0 be the thread number adopted by the last iteration. Let 4m0 := 1 at the first iteration. Pi−1 • For i = 1, 2, . . . , if 8mi−1 ≤ M − j=1 mj , the let mi := 4mi−1 . Otherwise, let mi := Pi−1 Pi M − j=1 mj . Terminate when j=1 mj = M .

It can be easily verified that the candidate thread numbers (which are the group sizes) in the current iteration are at least as large as that of the last iteration. The candidate thread numbers are 4m0 , 16m0 , . . . until they consume all of the available cores. The i-th group is randomly allocated with mi Parametrized RDD partitions for training, and allocated with another mi Parametrized RDD partitions for testing. In the training phase, they execute the algorithm on mi parallel threads, following the parallelization strategy described in Section 4.2. In the testing phase, the training results are broadcast to all the partitions. The thread number associated with the smallest testing loss will be chosen. The user is asked to provide an evaluation function ℓ : W × S → R which maps a variable-sample pair to a loss value. This function, for example, can be chosen as the element-wise loss for optimization problems, or the negative log-likelihood of probabilistic models. If the user doesn’t specify an evaluation function, then the largest mi will be chosen by the system. Once a thread number is chosen, its training result will be applied to all Parametrized RDD partitions. The allocation scheme ensures that the largest thread number is at least 3/4 of M . Thus, in case that M is the best degree of parallelism, the computation power will not be badly wasted. The allocation scheme also ensures that M will be the only candidate of parallelism if the last iteration’s thread number is greater than M/2. Thus, the degree of parallelism will quickly converge to M if it outperforms other degrees. Finally, the thread number is not updated in every iteration. If the same thread number has been chosen by multiple consecutive tests, then the system will continue using it for a long time, until some retesting criterion is satisfied.

18

F

Proof of Theorem 1

We assume that k∇ℓ(w; x)k2 ≤ G for any (w, x) ∈ W × S. The theorem will be established if the following inequality holds: n T X m X X k=1 i=1 j=1

k k 2ηi,j E[L(wi,j )−





T



2

L(w )] ≤ mE[kw0 − w k2 − kw − w k2 ] + G

n T X m X X

k 2 (ηi,j ) (30)

k=1 i=1 j=1

To see how inequality (30) proves the theorem, notice that the convexity of function L yields PT Pm Pn k k ∗ k=1 i=1 j=1 ηi,j E[L(wi,j ) − L(w )] ∗ E[L(wj ) − L(w )] ≤ . PT Pm Pn k k=1 i=1 j=1 ηi,j

Thus, inequality (30) implies

P P Pn k 2 mE[kw0 − w∗ k2 − kwT − w∗ k2 ] + G2 Tk=1 m i=1 j=1 (ηi,j ) . E[L(w j ) − L(w )] ≤ P P Pn k 2 Tk=1 m i=1 j=1 ηi,j ∗

By the assumptions on ηt , it is easy to see that the numerator of right-hand side is bounded, but the denominator is unbounded. Thus, the fraction converges to zero as T → ∞. It remains to prove inequality (30). We prove it by induction. The inequality trivially holds for T = 0. For any integer k > 0, we assume that the inequality holds for T = k − 1. At iteration k, k ≡ w k−1 . For any j ∈ {1, . . . , n}, let every thread starts from the shared vector wk−1 , so that wi,1 k be a shorthand for ∇ℓ(w k ; x). A bit of algebraric transformation yields: gi,j i,j k k k k k k k kwi,j+1 − w∗ k22 = kΠW (wi,j − ηi,j gi,j ) − w∗ k22 ≤ kwi,j − ηi,j gi,j − w∗ k22 k k 2 k 2 k k k = kwi,j − w∗ k22 + (ηi,j ) kgi,j k2 − 2ηi,j hwi,j − w∗ , gi,j i,

where the inequality holds since w∗ ∈ W and ΠW is the projection onto W . Taking expectation on k k ≤ G, we have both sides of the inequality and using the assumption that kgi,j 2 k k k 2 k k k E[kwi,j+1 − w∗ k22 ] ≤ E[kwi,j − w∗ k22 ] + G2 (ηi,j ) − 2ηi,j E[hwi,j − w∗ , ∇L(wi,j )i].

k − w ∗ , ∇L(w k )i ≥ L(w k ) − L(w ∗ ). Plugging in this By the convexity of function L, we have hwi,j i,j i,j inequality, we have k k k k k 2 2ηi,j E[L(wi,j ) − L(w∗ )] ≤ E[kwi,j − w∗ k22 ] − E[kwi,j+1 − w∗ k22 ] + G2 (ηi,j ) .

(31)

Summing up inequality (31) for i = 1, . . . , m and j = 1, . . . , n, we obtain m X n X i=1 j=1

k k 2ηi,j E[L(wi,j ) − L(w∗ )] ≤ mE[kwk−1 − w∗ k22 ] −

+

n m X X

m X i=1

k E[kwi,n+1 − w∗ k22 ]

k 2 G2 (ηi,j ) .

(32)

i=1 j=1

1 k wi,n+1 . Thus, Jensen’s inequality implies Notice that wk = m Plugging this inequality to upper bound (32) yields m X n X i=1 j=1

k k 2ηi,j E[L(wi,j )



− L(w )] ≤ mE[kw

k−1

− w∗ k22

Pm

k i=1 kwi,n+1

k

− kw −

w∗ k22 ]

+

− w∗ k22 ≥ mkwk − w∗ k2 . m X n X

k 2 G2 (ηi,j ) .

i=1 j=1

The induction is complete by combining upper bound (33) with the inductive hypothesis. 19

(33)

G

Proof of Theorem 2

Recall that wk is the value of vector w after iteration k. Let wik be the output of thread i at the 1 Pm k end of iteration k. According to the update formula, we have w = ΠB ( m i=1 wik ), where ΠB (·) is the projector to the set B. The set B contains the optimal solution w∗ . Since projecting to a convex set doesn’t increase the point’s distance to the elements in the set, and because that wik (i = 1, . . . , m) are mutually independent conditioning on wk−1 , we have m

2 ii h h 1 X

E[kwk − w∗ k22 ] ≤ E E wik − w∗ wk−1 m 2 i=1

=

1 m2

m X i=1

E[E[kwik − w∗ k22 |wk−1 ]] +

1 X E[E[hwik − w∗ , wjk − w∗ i|wk−1 ]] m2 i6=j

m−1 1 E[kE[w1k |wk−1 ] − w∗ k22 ] = E[kw1k − w∗ k22 ] + m m

(34)

Equation (34) implies that we could upper bound the two terms on the right-hand side respectively. To this end, we introduce three shorthand notations: ak := E[kwk − w∗ k22 ], bk := E[kw1k − w∗ k22 ],

ck := E[kE[w1k |wk−1 ] − w∗ k22 ]. 1 0 ∗ 0 bk + m−1 Essentially, equation (34) implies ak ≤ m m ck . Let a0 := kw − w k2 where w is the initial vector. The following two lemmas upper bounds bk+1 and ck+1 as functions of ak . We defer their proofs to the end of this section.

Lemma 1. For any integer k ≥ 0, we have bk+1 ≤

k2 β1 ak + (k + 1)2 (k + 1)2 n

where

β1 := 4G2 /λ2 .

Lemma 2. We have c1 ≤ β22 /n2 and for any integer k ≥ 1, ck+1

√ 2β2 ak + β22 /n k2 ≤ ak + (k + 1)2 (k + 1)2 n

where

  8G2 (L + G/ρ2 ) β2 := max ⌈2H/λ⌉R, . λ3

Combining equation (34) with the results of Lemma (1) and Lemma (2), we obtain an upper bound on a1 : a1 ≤

β1 β2 + 22 := β3 . mn n

(35)

Furthermore, Lemma (1) and Lemma (2) upper bound ak+1 as a function of ak : ak+1 ≤

√ β3 + 2β2 ak /n k2 a + . k (k + 1)2 (k + 1)2

20

(36)

Using upper bounds (35) and (36), we claim that √ β3 + 2β2 β3 /n ak ≤ k

for

k = 1, 2, . . .

(37)

By inequality (35), the claim is true for k = 1. We assume that the claim holds for k and prove it for k + 1. Using the inductive hypothesis, we have ak ≤ β3 . Thus, inequality (36) implies √ √ √ k2 β3 + 2β2 β3 /n β3 + 2β2 β3 /n β3 + 2β2 β3 /n ak+1 ≤ · + = (k + 1)2 k (k + 1)2 (k + 1)n which completes the induction. Note that both β1 and β2 are constants that are independent of k, m and n. Plugging the definition of β3 , we can rewrite inequality (37) as ak ≤

4G2 C2 C1 + 2. + 2 1/2 3/2 λ kmn km n kn

where C1 and C2 are constants that are independent of k, m and n. This completes the proof of the theorem.

G.1

Proof of Lemma 1

In this proof, we use wj as a shorthand to denote the value of vector w at iteration k + 1 when the first thread is processing the j-th element. We drop the notation’s dependence on the iteration number and on the thread index since they are explicit from the context. Let gj = ∇ℓ(wj ; xj ) be the gradient of loss function ℓ with respect to wj on the j-th element. Let ηj be the stepsize 2 . parameter when wj is updated. It is easy to verify that ηj = λ(kn+j) We start by upper bounding the expectation of kw1k+1 − w∗ k22 conditioning on wk . By the strong convexity of L and the fact that w∗ minimizes L, we have hE[gj ], wj − w∗ i ≥ L(wj ) − L(w∗ ) +

λ kwj − w∗ k22 . 2

as well as L(wj ) − L(w∗ ) ≥

λ kwj − w∗ k22 . 2

Hence, we have hE[gj ], wj − w∗ i ≥ λ kwj − w∗ k22

(38)

Recall that ΠW (·) denotes the projection onto set W . By the convexity of W , we have kΠW (u) − vk2 ≤ ku − vk2 for any u, v ∈ W . Using these inequalities, we have the following: E[kwj+1 − w∗ k22 |wk ] = E[kΠW (wj − ηj gj ) − w∗ k22 |wk ] ≤ E[kwj − ηj gj − w∗ k22 |wk ]

i h = E[kwj − w∗ k22 |wk ] − 2ηj E hgj , wj − w∗ i|wk + ηj2 E[kgj k22 |wk ]. 21

Note that the gradient gj is independent of wj conditioning on wk−1 . Thus, we have h i E hgj , wj − w∗ i|wk = E[hE[gj ], wj − w∗ i|wk ] ≥ λE[kwj − w∗ k22 |wk ]. where the last inequality follows from inequality (38). As a consequence, we have E[kwj+1 − w∗ k22 |wk ] ≤ (1 − 2ηj λ)E[kwj − w∗ k22 |wk ] + ηj2 G2 . Plugging in ηj =

2 λ(kn+j) ,

E[kwj+1 − Case k = 0:

we obtain

w∗ k22 |wk ]





4 1− kn + j



E[kwj − w∗ k22 |wk ] +

4G2 . λ2 (kn + j)2

(39)

We claim that any j ≥ 1, E[kwj − w∗ k22 ] ≤

4G2 λ2 j

(40)

Since w11 = wn+1 , the claim establishes the lemma. We prove the claim by induction. The claim holds for j = 1 because inequality (38) yields kw1 − w∗ k22 ≤

Gkw1 − w∗ k2 hE[g1 ], w1 − w∗ i ≤ λ λ



kw1 − w∗ k2 ≤ G/λ.

Otherwise, we assume that the claim holds for j. Then inequality (39) yields   4G2 4 4G2 + 2 2 E[kwj+1 − w∗ k22 ] ≤ 1 − 2 j λ j λ j 2 4G j − 4 + 1 4G2 = 2 , ≤ λ j2 λ2 (j + 1) which completes the induction. Case k > 0:

We claim that for any j ≥ 1, E[kwj −

w∗ k22 |wk ]

1 ≤ (kn + j − 1)2



2

k

(kn) kw −

w∗ k22

4G2 (j − 1) + λ2



(41)

We prove (41) by induction. The claim is obviously true for j = 1. Otherwise, we assume that 4 2 ≤ ( kn+j−1 the claim holds for j and prove it for j + 1. Since 1 − kn+j kn+j ) , combining the inductive hypothesis and inequality (39), we have E[kwj+1 − w∗ k22 |wk ]   1 4G2 (j − 1) 4G2 2 k ∗ 2 ≤ (kn) kw − w k + + 2 (kn + j)2 λ2 λ2 (kn + j)2   4G2 j 1 (kn)2 kwk − w∗ k22 + . = 2 (kn + j) λ2 which completes the induction. Note that claim (41) establishes the lemma since w1k = wn+1 . 22

G.2

Proof of Lemma 2

In this proof, we use wj as a shorthand to denote the value of vector w at iteration k + 1 when the first thread is processing the j-th element. We drop the notation’s dependence on the iteration number and on the thread index since they are explicit from the context. Let gj = ∇ℓ(wj ; xj ) be the gradient of loss function ℓ with respect to wj on the j-th element. Let ηj be the stepsize 2 parameter when wj is updated. It is easy to verify that ηj = λ(kn+j) . Recall the neighborhood Uρ ⊂ W in Assumption A, and note that wj+1 − w∗ = ΠW (wj − ηj gj − w∗ )

= wj − ηj gj − w∗ + I(wj+1 6∈ Uρ ) (ΠW (wj − ηj gj ) − (wj − ηj gj ))

since when w ∈ Uρ , we have ΠW (w) = w. Consequently, an application of the triangle inequality and Jensen’s inequality gives kE[wj+1 − w∗ |wk ]k2 ≤ kE[wj − ηj gj − w∗ |wk ]k2 i h + E k(ΠW (wj − ηj gj ) − (wj − ηj gj ))1(wj+1 ∈ / Uρ )k2 |wk .

By the definition of the projection and the fact that wj ∈ W , we additionally have

kΠW (wj − ηj gj ) − (wj − ηj gj )k2 ≤ kwj − (wj − ηj gj ))k2 ≤ ηj kgj k2 . Thus, by combining the above two inequalities, and applying Assumption B, we have kE[wj+1 − w∗ |wk ]k2 ≤ kE[wj − ηj gj − w∗ |wk ]k2 + ηj E[kgj k2 1(wj+1 6∈Uρ ) |wk ] ≤ kE[wj − ηj gj − w∗ |wk ]k2 + ηj G · P (wj 6∈ Uρ |wk ) ≤ kE[wj − ηj gj − w∗ |wk ]k2 + ηj G ·

E[kwj+1 − w∗ k22 |wk ] , ρ2

(42)

where the last inequality follows from the Markov’s inequality. Now we turn to controlling the rate at which wj − ηj − gj goes to zero. Let ℓj (·) = ℓ(·; xj ) be a shorthand for the loss evaluated on the j-th data element. By defining rj := gj − ∇ℓj (w∗ ) − ∇2 ℓj (w∗ )(wj − w∗ ), a bit of algebra yields gj = ∇ℓj (w∗ ) + ∇2 ℓj (w∗ )(wj − w∗ ) + rj .

First, we note that E[∇ℓj (w∗ )|wk ] = ∇L(w∗ ) = 0. Second, the Hessian ∇2 ℓj (w∗ ) is independent of wj . Hence we have E[gj |wk ] = E[∇ℓj (w∗ )] + E[∇2 ℓj (w∗ )|wk ] · E[wj − w∗ |wk ] + E[rj |wk ] = ∇2 L(w∗ )E[wj − w∗ |wk ] + E[rj |wk ].

Taylor’s theorem implies that rj is the Lagrange remainder rj = (∇2 ℓj (w′ ) − ∇2 ℓj (w∗ ))(w′ − w∗ ), 23

(43)

where w′ = αwj + (1 − α)w∗ for some α ∈ [0, 1]. Applying Assumption B, we find that E[krj k2 |wk ] ≤ E[k∇2 ℓj (w′ ) − ∇2 ℓj (w∗ )k2 kwj − w∗ k2 |wk ] ≤ LE[kwj − w∗ k22 |wk ].

(44)

By combining the expansion (43) with the bound (44), we find that



kE[wj − ηj gj − w∗ |wk ]k2 = E[(I − ηj ∇2 F0 (w∗ ))(wj − w∗ ) + ηj rj |wk ]

2

≤ k(I − ηj ∇2 L(w∗ ))E[wj − w∗ |wk ]k2 + ηj LE[kwj − w∗ k22 |wk ].

2 , this inequality then Using the earlier bound (42) and plugging in the assignment ηj = λ(kn+j) yields

kE[wj+1 − w∗ |wk ]k2 ≤ I − ηj ∇2 L(w∗ ) 2 kE[wj − w∗ |wk ]k2   GE[kwj+1 − w∗ k22 |wk ] 2 ∗ 2 k + . (45) LE[kwj − w k2 |w ] + λ(kn + j) ρ2

Next, we split the proof into two cases when k = 1 and k > 1. Case k = 0: Note that by strong convexity and our condition that k∇2 L(w∗ )k2 ≤ H, whenever ηj H ≤ 1 we have kI − ηj ∇2 L(w∗ )k2 = 1 − ηj λmin (∇2 L(w∗ )) ≤ 1 − ηj λ Define τ0 = ⌈2H/λ⌉; then for j ≥ τ0 , we have ηj H ≤ 1. As a consequence, inequality (40) (in the proof of Lemma 1) and inequality (45) yield that for any j ≥ τ0 , kE[wj+1 − w∗ ]k2 ≤ (1 − 2/j) kE[wj − w∗ ]k2 + As shorthand notations, we define two intermediate variables ut = kE(wj − w∗ )k2

and b1 =

Inequality (46) then implies the inductive relation uj+1 ≤ (1 − 2/j)uj + b1 /j 2

 8G2 L + G/ρ2 . 3 2 λ j

(46)

 8G2 L + G/ρ2 . 3 λ

for any j ≥ τ0 .

Now we claim that by defining b2 := max{τ0 R, b1 }, we have uj ≤ β/j. Indeed, it is clear that uj ≤ τ0 R/j for j = 1, 2, . . . , τ0 . For t > τ0 , using the inductive hypothesis, we have uj+1 ≤

b1 b2 j − 2b2 + b2 b2 (j − 1) b2 (1 − 2/j)b2 + 2 ≤ = ≤ . 2 2 j j j j j+1

This completes the induction and establishes the lemma for k = 0.

24

Case k > 0: Let uj = kE[wj − w∗ |wk ]k2 and δ = kwk − w∗ k2 as shorthands. Combining inequality (41) (in the proof of Lemma 1) and inequality (45) yield     2 4G2 j 2(L + G/ρ2 ) 2 2 uj+1 ≤ 1 − (kn) δ + uj + kn + j λ(kn + j)(kn + j − 1)2 λ2     2 2(L + G/ρ ) 4G2 n 2 2 2 uj + (kn) δ + ≤ 1− kn + j λ(kn + j)(kn + j − 1)kn λ2 (kn + j − 2)(kn + j − 1) b1 knδ2 + b2 /k = uj + (47) (kn + j − 1)(kn + j) (kn + j − 1)(kn + j) where we have introduced shorthand notations b1 := notations, we claim that uj ≤

2(L+G/ρ2 ) λ

and b2 :=

8G2 (L+G/ρ2 ) . λ3

With these

(kn − 1)knδ + (j − 1)(b1 knδ2 + b2 /k) . (kn + j − 2)(kn + j − 1)

(48)

We prove the claim by induction. Indeed, since u1 = δ, the claim obviously holds for j = 1. Otherwise, we assume that the claim holds for j, then inequality (47) yields b1 knδ2 + b2 /k (kn − 1)knδ + (j − 1)(b1 knδ2 + b2 /k) + (kn + j − 1)(kn + j) (kn + j − 1)(kn + j) 2 (kn − 1)knδ + j(b1 knδ + b2 /k) = , (kn + j − 1)(kn + j)

uj+1 ≤

which completes the induction. As a consequence, a bit of algebraic transformation yields (kn − 1)knδ + n(b1 knδ2 + b2 /k) ((k + 1)n − 1)(k + 1)n 2 2 k n δ nb2 /k nb1 knδ2 ≤ + + (k + 1)2 n2 kn(k + 1)n kn(k + 1)n  2 k b1 δ 2 b2 ≤ δ+ + k+1 k + 1 k(k + 1)n ! 2 b δ kδ + k+1 b k 1 2 k + 2 = k+1 k+1 k n

kE[w1k+1 − w∗ |wk ]k2 = un+1 ≤

By the fact that wk ∈ B, we have kE[w1k+1

k+1 k b1 δ ∗



− w |w

k

k+1 k b1 D

]k22





(49)

≤ 1. Thus, inequality (49) implies

k k+1

2 

b2 δ+ 2 k n

2

Taking expectation on both sides of the inequality, then applying Jensen’s inequality, we obtain p 2b2 E[δ2 ] + b22 /n k2 E[δ2 ] k+1 ∗ k 2 E[kE[w1 − w |w ]k2 ] ≤ + . (k + 1)2 (k + 1)2 n Hence, the lemma is established. 25

Dataset Covtype RCV1 MNIST 8M

Number of Samples 581,012 677,399 1,617,570

Number of Features 54 47,236 784

Table 1: Datasets for logistic regression. These datasets are obtained from the LIBSVM Data website [4]. 10 0

GD (64 threads) SGD (1 thread) Splash (64 threads)

10 -1

10 0 10 -1

10 -3 10 -4

0

20

40

60

80

10

optimality gap

optimality gap

optimality gap

10 -2 -1

10 -2

0

100

300

10 -3

0

500

1000

running time (seconds)

running time (seconds)

(b) RCV1 Dataset

(c) MNIST Dataset (3 versus 8)

running time (seconds)

(a) Covtype Dataset

200

10 -2

Figure 4: The optimality gap ℓ(w) − ℓ(w∗ ) as a function of the running time for logistic regression. The logistic regression model is trained by Gradient Descent (GD), Stochastic Gradient Descent (SGD) and its parallel version implemented by Splash.

H

More Experiments

In this section, we report more details on the experiment.

H.1

Stochastic Gradient Descent

We solve binary classification problems using logistic regression. The goal is to minimize the following objective function n

λ 1X log(1 + exp(−yi hwi , xi i)) + kwk22 L(w) = n 2 j=1

where xi ∈ Rd is the feature vector of the i-th element and yi ∈ {−1, 1} is its binary label. We choose the regularization coefficient λ = 10−6 . The datasets for this task are summarized in Table 1. Among the three datasets, Covtype and MNIST have dense features. The RCV1 dataset has sparse features. For the MNIST dataset, we extract digits “3” and “8” and solve the binary classification problem of distinguishing these two digits. We employ the SGD algorithm in formula (8). In each step, a data entry (xi , yi ) is fed into the processing function. The algorithm is initialized at the origin and the stepsize is chosen by √ ηt = η/ t. We manually tune the learning rate η to optimize the single-thread SGD performance. In particular, we set η = 20 for the Covtype dataset and η = 100 for the RCV1 and the MNIST 8M dataset. Between two rounds of communication, the execution engine takes a full pass over the MNIST dataset. Since the Covtype and the RCV1 datasets are relatively small, we let the execution engine taking 16 passes on Covtype and taking 8 passes on RCV1 instead. 26

Dataset Covtype RCV1 MNIST

Loss Value 0.5150 0.0602 0.1617

SGD Time (sec) 78 450 2,328

Splash Time (sec) 5 12 84

Table 2: Comparing the running time of single-thread SGD and Splash for achieving the same loss function value. These loss values are achieved by the single-thread SGD taking 30 passes over the dataset. Dataset NIPS Enron NYTimes

Number of Words 1,900,000 6,400,000 99,542,125

Number of Documents 1500 39,861 300,000

Vocabulary Size 12,419 28,102 102,660

Table 3: Datasets for Latent Dirichlet Allocation. These datasets are obtained from the UCI machine learning repository [15] We compare Splash with the standard single-thread SGD and the 64-thread gradient descent. Figure 4 plots the convergence curve of the three algorithms. The convergence is measured by the optimality gap ℓ(w) − ℓ(w∗ ), where w is the output of the algorithm and w∗ is the vector that minimizes the objective function. We observe that stochastic algorithms have superior performance over batch gradient descent. Even the single-thread SGD is much faster than the 64-thread gradient descent. It highlights the benefit of employing stochastic algorithm for solving convex optimization problems. Furthermore, the Splash implementation significantly outperforms the single-thread SGD. We find that Splash works well on both low dimensional datasets (Covtype and MNIST) and the high-dimensional dataset (RCV1). Table 2 lists the running time of single-thread SGD and Splash for achieving the same loss function value. These loss values are achieved by the single-thread SGD taking 30 passes over the dataset. On the Covtype, RCV1 and MNIST dataset, Splash is 16x, 37x and 28x faster than the single-thread SGD.

H.2

Collapsed Gibbs Sampling

We turn to train the Latent Dirichelet Allocation (LDA) model using collapsed Gibbs sampling (24) – (25). We conduct experiments on three datasets: the smaller NIPS paper and Enron email dataset and the larger New York Times article dataset. The detailed descriptions of the datasets are summarized in Table 3. The number of topics is set to be K = 20. The hyper-parameters are chosen α = 50/K and β = 0.01. We randomly partition the words in each document, so that half of them is for training and the remaining is for testing. Before the program starts, the word topics are randomly initialized. For both datasets, Splash takes one pass over the dataset between two rounds of communication. The learning accuracy is measured by the perplexity on the testing set. Let p(wd ) be the probability of testing words in document d under the LDA model, then the perplexity is equal to  PD log(p(w ))  d perplexity = exp − d=1 PD N d=1 d 27

2500

perplexity

perplexity

Gibbs sampling (1 thread) Splash (64 threads)

2000

5500

11000

5000

10000

4500

9000

perplexity

3000

4000 3500

1000

2000

7000 6000

3000 0

8000

0

2000

running time (seconds)

running time (seconds)

(a) NIPS Dataset

(b) Enron Dataset

4000

0

1

2

running time (seconds) ×10 4

(b) NYTimes Dataset

Figure 5: The perplexity score as a function of the running time for the LDA model. The model is trained by collapsed Gibbs sampling and its parallel version implemented by Splash. Dataset NIPS Enron NYTimes

Perplexity 1,825 3,119 6,627

Gibbs Time (sec) 2,227 9,392 26,022

Splash Time (sec) 59 63 875

Table 4: Comparing the running time of single-thread collapsed Gibbs sampling (Gibbs) and its parallel version implemented by Splash for achieving the same perplexity score. These scores are achieved by collapsed Gibbs Sampling taking 100 passes over the NIPS and Enron dataset, or taking 15 passes over the NYTimes dataset. where Nd is the number of testing words in document d. A smaller perplexity score indicates a better predictive performance of the model. Figure 5 shows that the perplexity score of Splash converges faster than the single-thread Gibbs sampling. In Table 4, we compare the running time of single-thread Gibbs sampling and its parallel version implemented by Splash to achieve the same perplexity score. Splash is 38x, 149x and 30x faster than the single-thread algorithm on the NIPS, Enron and NYTimes dataset, respectively.

H.3

Stochastic Variational Inference

Variational inference [2] and stochastic variational inference [10] are alternative efficient approaches to learning the LDA model. In this experiment, we compare the parallel variational inference (VI) algorithm, the single-thread stochastic variational inference (SVI) algorithm and the parallel SVI implemented by Splash. Note that the VI algorithm is easily parallelizable. To parallelize VI, the dataset is partitioned into m subsets according to the document indices. Each thread updates the document-topic parameter γdk for documents in its own subset, then all threads perform a reduce operation to compute the new topic-word parameter λkw . Since VI is a batch algorithm, there is no conflict in parallelization. We test the three algorithms on the three datasets described in Table 3. We choose the topic number K = 20 and choose the same hyper-parameters α, β as in collapsed Gibbs sampling. The learning rate ρt is chosen to be ρt = (t+1)−0.7 which optimizes the performance of the single-thread algorithm. Following the paper by Hoffman et al. [10], we organize documents into mini-batches, so that SVI processes a mini-batch like processing a document. Using mini-batche training has been shown improving the SVI performance. On the NIPS, Enron and NYTimes dataset, the mini-batch 28

-7.7 -7.75 VI (64 threads) SVI (1 thread) Splash (64 threads)

-7.8 -7.85 0

1000

2000

3000

-8.2

-8.4

-8.6 0

1000

2000

3000

predictive log-likelihood

-7.65

predictive log-likelihood

predictive log-likelihood

-8 -8.7 -8.8 -8.9 -9 0

5000

10000

15000

running time (seconds)

running time (seconds)

running time (seconds)

(a) NIPS Dataset

(b) Enron Dataset

(c) NYTimes Dataset

Figure 6: The predictive log-likelihood as a function of the running time for the LDA model trained by Variational Inference (VI), Stochastic Variational Inference (SVI) and parallel SVI implemented by Splash. Dataset NIPS Enron NYTimes

Log-likelihood −7.661 −8.086 −8.706

VI Time (sec) 497 1,821 22,752

SVI Time (sec) 2,837 6,674 20,988

Splash Time (sec) 153 335 1,298

Table 5: Comparing the running time of parallel Variational Inference (VI), single-thread Stochastic Variational Inference (SVI) and its parallel version implemented by Splash for achieving the same predictive log-likelihood score. These scores are achieved by the Variational Inference algorithm taking 15 passes over the dataset. size is set to be 8, 24 and 128, respectively. We let Splash synchronize once per processing a single mini-batch. To evaluate the algorithm’s performance, we resort to the predictive log-likelihood metric used by Hoffman et al. [10] for evaluating the SVI algorithm. In particular, we partition the dataset evenly into a training set S and a test set T . For each test document in T , we partition its words into a set of observed words wobs and held-out words who , keeping the sets of unique words in wobs and who disjoint. We approximate the posterior distribution of the topic-word parameter λkw implied by the training data S, and then use that approximate posterior and the word set wobs to estimate the document-topic parameter γdk for the test document. Finally, the predictive log-likelihood of the held-out words, namely log p(wnew |wobs , S), are computed. The performance of the algorithm is measured by the average predictive log-likelihood per held-out word. Figure 6 plots the predictive log-likelihood as a function of the running time. Again, we find that Splash converges faster than the single-thread SVI algorithm on all datasets. The running time of SVI and Splash to achieve the same predictive log-likelihood are summarized in Table 5. More precisely, Splash achieves 19x, 20x and 16x speedups over the single-thread SVI on the three dataset. The pairwise comparison between VI, SVI and Splash are quite interesting. For small dataset like NIPS, the parallel batch algorithm (VI) is more efficient than the single-thread stochastic algorithm (SVI). For medium sized Enron dataset, the parallel batch algorithm is comparable to the single-thread stochastic algorithm. For large dataset (NYTimes), the single-thread stochastic algorithm is faster than the parallel batch algorithm. It verifies the advantage of using stochastic

29

AUC score 0.91 0.94

GD Time (sec) 28 619

BPR Time (sec) 355 1,331

Splash Time (sec) 9 112

Table 6: Comparing the running time of parallel gradient descent (GD), Bayesian personalized ranking (BPR) and its parallel version implemented by Splash for achieving the same AUC score. algorithms to processing large datasets. On all of the three datasets, Splash is faster than the parallel batch algorithm. In particular, Splash is nearly 18x faster than the parallel variational inference algorithm on the NYTimes dataset.

H.4

Stochastic Collaborative Filtering

We use stochastic collaborative filtering to solve the Netflix movie recommendation problem. The Netflix dataset contains 100 million movie ratings made by 480,000 users on 17,000 movies. We split the dataset evenly into a training set and a testing set. In the language of personalized recommendation, we say that a user chooses a movie if the user rates the movie. The goal is to predict the set of movies that the user chooses in the testing set. To perform collaborative filtering, users and movies are associated with d-dimensional latent vectors, where d = 10. The latent vectors are learnt through the Bayesian personalized ranking (BPR) algorithm described by equation (27) – (29). We choose learning parameters λ = η = 0.01. Splash takes one pass over the dataset between two rounds of communication. The performance is measured by the AUC (Area under the ROC curve) metric. Given a user u, we assume that Tu is the set of movies that she rates in the testing set, and I is the set of all movies. The AUC score for this user is computed by AUCu =

1 |Tu ||I\Tu |

X

i∈Tu , j∈I\Tu

I(hvu , vi i > hvu , vj i),

where I(·) is the indicator function which returns 1 if the inner statement is true and returns 0 otherwise. A higher AUC score indicates a better prediction accuracy. The AUC on the overall testing set is the average individual users’ AUC weighted by the number of the rated movies. As an alternative baseline, we compare the stochastic algorithm with a batch alternating minimization method for minimizing the loss function:  XX  2 2 E[log 1 + exp(−hvu , vi − vj i) ] + λkvu k2 + λkvi k2 , L(v) := u∈U i∈Tu

where U is the set of all users and Tu is the set of movies that the user chooses. The movie index j for vector vj is drawn uniformly at random from the movies that the user has not chosen. It is easy to verify that BPR is a stochastic algorithm for minimizing function L(v). Thus, the batch method which minimizes the same function is a natural baseline. We parallelize the alternating minimization method and compare its convergence rate with BPR and Splash. Figure 7 plots the AUC scores as function of the running time. Splash attains a high AUC score in one iteration. The single-thread BPR takes much longer time to achieve the same accuracy. More specifically, Splash achieves the AUC score around 0.91 in 9 seconds. For the single-thread BPR algorithm, it takes 355 seconds to achieve the same score. Splash also converges faster than the 30

AUC

0.94 0.93 0.92 Batch (64 threads) Stochastic (1 thread) Splash (64 threads)

0.91 0.9 0

500 running time (seconds)

1000

Figure 7: The AUC score as a function of the running time on Netflix movie recommendation problem. The model is trained by parallel Gradient Descent, Bayesian Personalized Ranking (BPR) and its parallel version implemented by Splash. parallel gradient descent. Table 6 compares the running time of the three algorithms. To achieve AUC = 0.91, Splash is 3x faster than the parallel gradient descent and 40x faster than the singlethread BPR. To achieve a higher accuracy AUC = 0.94, Splash is 5.5x faster than the parallel gradient descent and 12x faster than the single-thread BPR. The experiment verifies the usefulness of Splash in learning collaborative filtering models, but the speedup rate over the single-thread algorithm is not as high as in solving convex optimization. We conjecture that it is partially due to the non-convexity nature of the problem. In addition, the AUC score is a combinatoric metric that cannot be directly optimized. Thus, it is not the objective function that the algorithm explicitly maximizes. As a consequence, even if Splash converges very fast to the optimal objective value, it doesn’t imply that it converges as fast to the best AUC score.

H.5

Comparing with other parallelization schemes

Finally, we compare the Splash’s parallelization strategy with two alternative schemes: accumulation and averaging. The accumulation scheme is defined in formula (3), which constructs the global update by summing up local updates. The averaging scheme is defined in formula (4), which averages local updates. In the context of SGD, the accumulation scheme resembles the HOGWILD! update scheme [21]. Figure 8(a) compares the convergence of the three strategies on the RCV1 dataset. We find that the accumulation scheme leads to divergence at the very beginning, then slowly converges in later iterations. The averaging scheme coverages in a much slower rate than that of Splash. By comparing Figure 4(a) and Figure 8(a), we find that the convergence of the averaging scheme is very similar to that of the single-thread SGD. It suggests that averaging doesn’t yield convergence speedup. For LDA trained by collapsed Gibbs sampling, parallelizing with the accumulation scheme is equivalent to the AD-LDA algorithm by Newman et al. [19]. Figure 8(b) shows that the accumulation scheme converges slower Splash. Although Splash is a general-purpose system, it outperforms the algorithm specifically designed for LDA. In contrast, the averaging scheme performs poorly by converging to a sub-optimal solution. For LDA trained by SVI, a straightforward application of the accumulation scheme is problem31

5000

10 0

perplexity

optimality gap

5500

Accumuation (64 threads) Averaging (64 threads) Splash (64 threads)

10 1

10 -1 10

-2

4500 4000 3500 3000

0

50 100 150 running time (seconds)

200

0

(a) Stochastic Gradient Descent (RCV1)

200 400 running time (seconds)

(b) Collapsed Gibbs Sampling (Enron)

-7.7 AUC

predictive log-likelhood

0.95

-7.8

0.9

-7.9 -8 0

500 running time (seconds)

0.85

1000

(c) Stochastic Variational Inference (NIPS)

0

200

400

600

running time (seconds)

(d) Stochastic Collaboerative Filtering (Netflix)

Figure 8: Comparing strategies for combining parallel updates: accumulation (3), averaging (4) or the Splash’s strategy. Figure (a)-(d) are respectively plots for stochastic gradient descent, collapsed Gibbs sampling, stochastic variational inference and stochastic collaborative filtering. atic, because the accumulation scheme may assign negative values to parameter λkw . Recall that λkw represents the probability that topic k generates word w, thus a negative probability is not allowed. We address the problem by a heuristic: setting λkw to be a small positive number when it is negative. Figure 8(c) compares Splash with the modified accumulation scheme and the averaging scheme. Although all strategies eventually converge, the accumulation scheme oscillates at early iterations and converges to a sub-optimal solution. The averaging scheme’s convergence is slightly slower than Splash. For stochastic collaborative filtering, Figure 8(d) compares Splash with the accumulation scheme and the averaging scheme on the Netflix dataset. The observation is similar to that of SGD: the accumulation scheme causes divergence at the very beginning, then it slowly converges to a better solution. The averaging scheme converges in a much slower rate than that of Splash.

32

Suggest Documents