Parallel programming languages have sought out many dif- ferent means by which ... such as Google App Engine allow programmers a frame- work that can be ...
MapScale: A Cloud Environment for Scientific Computing Chris Bunch, Brian Drawert, Matthew Norman Computer Science Department University of California, Santa Barbara
Abstract Parallel programming languages have sought out many different means by which many numbers of cores can be utilized for a common goal. Programming languages such as MPI provide a low-level, message passing system, while others such as UPC provide a high-level, shared memory model. Both of these languages and many others require the user to be aware of the number of nodes available in their system at a given time as well as other details of the system. MapReduce takes an orthogonal approach to the debate offered by languages such as MPI and UPC. It offers an effective method for solving embarassingly parallel programs and uses neither message passing nor shared memory, but instead requires the programmer to supply two serial programs, named Map and Reduce. Communication is minimal since these functions can only send or receive data at the beginning or end of their execution. This work aims to explore which scientific computing problems are amenable to representation in MapReduce, and which are not. We implement a number of non-trivial algorithms and measure their performance, as well as report the Map and Reduce tasks used. Bottlenecks in computation are explored and provides many avenues for discussion.
1. Introduction Scientific computing and distributed computing have gone hand-in-hand for many decades now. Many programming languages, such as MPI, UPC, and Cilk [2], have been created with the intent to give programmers the tools they need in order to run large programs efficiently over clusters of computers. However, with the introduction of cloud computing, an unprecedented amount of computing power is now available to the average programmer. Cloud infrastructures such as Amazon EC2 and Eucalyptus [6] give users an easy way to request arbitrary numbers of computers and log into them via commonly used protocols, and cloud platforms such as Google App Engine allow programmers a framework that can be heavily optimized and scaled on Google’s closed-source infrastructure.
Previous work has introduced AppScale [3], an open-source Platform-as-a-Service (PaaS) that allows users to deploy their Google App Engine apps to an open hosting environment that can be scaled in a more transparent manner. This platform runs across both Amazon EC2 and Eucalyptus, allowing users to run their apps across as many compute nodes as they can afford. However, the applications commonly deployed using Google App Engine to this point have been web applications of a relatively simple nature. Since it is already well known that Google’s MapReduce [4] programming paradigm has been used with great success in parallelizing computation, we thus seek to investigate the feasibility of MapReduce on scientific computation over the AppScale platform.
2. Related Work MapReduce is known to achieve notable speedups on certain problems, but it is not yet clear how many problems benefit from this framework. The original paper [4] bills itself as an excellent framework for embarrassingly parallel problems, and further work has aimed at widening the domain of what problems can be expressed in MapReduce’s semantics. It has already been shown [7] that matrix multiplication, the n-body problem, and linear regression all are highly amenable to MapReduce, and that MapReduce is not simply limited to running on distributed systems. Some in the parallel programming community have taken on MapReduce and shown that it can be implemented on FPGAs as well as GPUs [8]. Both works have been shown to be novel, but we differentiate ourselves by the underlying cloud environment we run on, the algorithms we implement, and the programming languages used to do so.
3. Problem Domain In this work we seek to examine the viability of implementing the NAS Parallel Benchmarks in MapReduce. Before we can present our results, however, it is necessary to give a brief summary of both the MapReduce programming paradigm as well as the eight algorithms we have worked on for this project.
3.1 MapReduce MapReduce is a programming framework whose ideas are certainly not new. As that paper [4] states, it took the names for its primitives from the map and reduce functions found in Lisp and other functional languages. Distributed programming languages such as MPI, UPC, and Cilk far precede MapReduce and have also offered the same capability. The notable difference from these implementations and MapReduce is that the latter enforces a much more restricted environment and claims it can offer greater benefits over other programming frameworks by doing so. Specifically, MapReduce only allows the programmer to write a Map and Reduce function, both of which must be stateless functions. This puts MapReduce at a high contrast with the previously designed languages, which give the programmer control over the distributed environment at a much finer level. MapReduce does not allow the user to perform communication between nodes or even for the functions to know how many nodes are performing the computation. To this point we have stated how MapReduce differs from other distributed programming solutions, but not exactly how MapReduce works. The framework allows users to write two functions, Map and Reduce, with the following method signatures: Map: (k1 , v1 ) → list(k2 , v2 ) Reduce: (k2 , list(v2 )) → list(v2 )
Although the original work is based on a closed-source implementation, we will use two open-source implementations that follow the same semantics described here. Hadoop MapReduce, one such implementation, requires that the Map and Reduce functions be written in Java, whereas Hadoop Streaming can use Map and Reduce functions constructed in any language. Our implementations use both products depending on the underlying programming language chosen for the given task. We note inline which algorithms are implemented in which languages to show the variety of languages that can intermingle in the system. 3.2 NAS Parallel Benchmarks The NAS Parallel Benchmarks [1] are a compilation of algorithms that have been chosen to be highly parallelizable in order to allow users to benchmark the speed of their system or cluster. Each algorithm has a number of unique properties that separate them from other algorithms: they are highly parallelizable, the solutions they produce are easily verifiable, and the algorithms are sufficiently generic that they can be used to solve a wide variety of problems. We have decided to focus our attention only on a subset of the first specification of the NAS Parallel Benchmarks, which provide the following algorithms. For each, we also provide a short description of the algorithm. • Embarassingly Parallel: Use the Marsaglia polar method
to generate independent Gaussian random variables. • Integer Sort: Use the linear time bucket sorting algorithm
The underlying MapReduce runtime is responsible for the large majority of the work in the system. It schedules mappers and reducers, restarts them if they fail for some reason (typically the user’s code has some error in it), and load balances work evenly across processors. The system starts up the MapReduce process by partitioning the input files to Map functions, which can be parallelized since it runs across a distributed, replicated file system (for Google this is GFS [5], for the open-source Hadoop this is HDFS). Once each mapper produces a list of key-value pairs, the runtime sorts the keys and spawns up reducers with the necessary keys and lists of pairs. Finally, a combiner phase is run after the reducers have completed running to merge the outputted lists and output them back to the distributed file system. By placing most of the complexity of the underlying system in the runtime, the programmer is alleviated from having to implement this on their own. Conversely, this restricted environment only allows the programmer to use MapReduce on programs that require no I/O within the Map and Reduce functions and are very compute intensive. However, this work will show which scientific computing problems can fit into this paradigm and which cannot (and if not, why not).
to sort a large number of integers. • Conjugate Gradient: An iterative method for solving sys-
tems of equations. • Fast Fourier Transform: An algorithm commonly used to
solve a three-dimensional partial differential equation. • Block Tridiagonal: A method to solve a linear system of
equations where the input matrix has the block tridiagonal form.
4. Evaluation The NAS Parallel Benchmarks provide descriptions of the various algorithms involved as well as deterministic methods to verify that their output is correct. We provide results for varying problem sizes for three of the five algorithms (both Conjugate Gradient and Block Tridiagonal took exceedingly long). All experiments were run over four Xen virtual machines, each with two 2.83 GHz Xeon processors with 2GB of memory. 4.1 Embarassingly Parallel This algorithm was designed in order to be able to benchmark a cluster’s peak floating point computation performance, and as such, uses an extremely minimal amount of
We provide implementations in Ruby using Hadoop Streaming and attempted to also implement it in Java using Hadoop MapReduce. Unfortunately, the Java version provides variables with sufficient precision but mathematical operations on it between very large and very small numbers cause them to lose too much precision. Java’s arbitrary precision class, BigDecimal, does not alleviate the problem, as this algorithm requires the use of the natural logarithm function, which is not provided by this class. We attempted to implement the natural logarithm function through only the exponential function (using a variant of the bisection method), but this also fell apart since the exponential function only takes integer powers to compute. Therefore, we only provide the Ruby implementation (which also uses arbitrary precision variables). Figure 1. Run times for the Embarassingly Parallel implementation
communication and is largely compute-bound. The pseudorandom number generator provided has a unique property, namely that given the initial seed to the generator, it can generate any number in the sequence of numbers the generator produces via a closed-form computation. This property allows us to easily parallelize this algorithm as well as have it generate the same result every time. In order to fit it into the MapReduce programming paradigm, our implementation is as follows (programmed in Ruby): • Pre-Computation: We are required to generate 2n random
numbers for this algorithm. Since the numbers can be generated in parallel, we seek to assign each Mapper and Reducer to work on a large number of random numbers at a time. Therefore, we break up the range 0 . . . 2n into 2n/2 buckets, each of size 2n/2 . We output the values of each range into an input file, which will be partitioned and read by our Map function. • Map: Each mapper is given a chunk of the input file, of
which each line contains a tuple of data corresponding to the beginning and ending range of values it needs to produce. It produces these values uses the Marsaglia polar method to map these values to pairs of random Gaussian varietes (x and y. It finds the absolute values of the pairs and calculates l, the floor of the greater of the two values. It then emits three values: l, x, and y. • Reduce: Each reducer reads in the three values from the
Map function, and since each reducer gets all the values with the same key, it has all the values for a given l. Since it has all the x and y pairs as well, it sums up the number of x values that were greater than their corresponding y values and vice-versa. Once this is done, it writes the sum of all the x values to a file and does the same for the y values.
The performance of our implementation of the Embarassingly Parallel (EP) benchmark is presented in Figure 1. Here the problem size refers to the number of distinct queries we make to the provided random number generator. We then take pairs of these numbers and if they pass a certain test we output them as the results (therefore the number of results are always less than or equal to the problem size). We can see in that figure that the running time is increasing exponentially as the problem size doubles (once the problem size is above 211 ). This is the expected relationship, and we wish to examine the speedup on greater numbers of nodes and greater problem sizes in future work. 4.2 Conjugate Gradient The Conjugate Gradient algorithm (CG) is an iterative algorithm that is commonly used to solve systems of equations. To make our CG implementation as general-purpose as possible, we have implemented it using dense matrices, as opposed to the more common case of using it only on sparse matrices. This algorithm relies on three simple primitive algorithms: DAXPY (multiplies a vector x by a constant a and adds to the result a vector y), DDOT (dot product), and MatVec (multiplying a matrix by a vector, producing a vector). Some minor computation is done outside of these three methods, but the majority of the time is spent in these methods. Therefore, we have constructed versions of each of these three methods in the MapReduce programming paradigm in the hope that speeding up these methods will greatly reduce the amount of time spent in the algorithm. Our implementations for each of the necessary methods are as follows, beginning with DAXPY (programmed in Python): • Pre-Computation: We construct an input file which con-
tains n lines (with the dimensions of the matrices X and Y being n × n). Each line contains the values i, A, Xi , Yi . • Map: For each line of input read in, the Map function
emits i as the key and A ∗ Xi + Yi as the value.
• Reduce: Since all the computation was performed in the
Map phase, the identity reducer is used here. In contrast to other algorithms where no reducer is used to save time on performance, we use a reducer here to ensure that the output data is sorted by key (thus the result vector ends up in the correct format). DDOT can be constructed in MapReduce via the following methods (programmed in Python): • Pre-Computation: We construct an input file which con-
tains n lines (with the dimensions of the vectors X and Y being n × 1). Each line contains the values i, Xi , Yi . • Map: For each line of input read in, the Map function
emits 1 as the key and emits Xi × Yi as the value. This will ensure that a single reducer will get all of the values so that it can sum up the data and produce the final result. • Reduce: Every Map task produces a value with the
same key (that is, 1), therefore only one Reduce task is spawned and is given all the values produced. Therefore, it simply sums up the values it is given and outputs the final summation as the dot product. Finally, we perform MatVec by the use of a general-purpose algorithm that is written to perform matrix multiplication. We thus perform matrix multiplication as follows (implemented in both Python and Perl, although this algorithm uses the Python implementation): • Map: In order to produce A = BC for matrices A, B,
and C, each mapper is passed in a row i of the matrix B and column j of the matrix C. The mapper then emits the key i, j and the value Σ(Bi × Cj ). • Reduce: No computation is needed here, so we use the
identity reducer here. Optimizations to this algorithm eliminate the Reduce phase altogether since we don’t need the output keys to be sorted. Unfortunately, this iterative algorithm takes an unreasonable amount of time to run, so we have omitted performance numbers on this algorithm. There are, however, many avenues we seek to explore to make this algorithm perform better. All three of these algorithms take in a large amount of data and produce a relatively small amount of output (DAXPY and DDOT each produce a single value while MatVec only produces a single column). Specifically, in DAXPY, each Map task takes in two columns and only produces a single value. Further exploration will investigate whether or not the performance of the underlying system can be improved by passing in more data in order to amortize the cost of starting up a new process with the given input data and passing the output data to the Sorter. DDOT’s Map phase suffers from this problem even more: it takes in two values and produces one value (which takes much less time than moving the data to that program and
Figure 2. Run times for the Fast Fourier Transform implementation out). Furthermore, since there is only one Reduce task, it is a bottleneck in the system. We wish to see the performance gain from using more than one node to perform these summations in parallel (and then using a single node to add up a number of values equal to the number of reducers). Finally, the Map phase in MatVec takes in two vectors of length n and only produces a single value. Future work will investigate whether or not passing in an entire matrix and a column, producing a full column of the result, adds enough computation to produce a meaningful speedup. 4.3 Fast Fourier Transform The Fast Fourier Transform is an algorithm that can be used to solve partial differential equations (although it can be used to solve PDE’s in any dimension, our implementation is in one dimension). This algorithm requires two computeheavy functions, DDOT and VecMul. Since DDOT has already been covered in our discussion of the Conjugate Gradient algorithm, we shall not discuss it again here. Furthermore, since VecMul is simply the AX component of DAXPY, its implementation is quite similar to DAXPY (programmed in Python): • Pre-Computation: We construct an input file which con-
tains n lines (with the dimensions of the vector X being n × 1). Note here we force the input to be a vector, whereas in Each line contains the values i, A, Xi . • Map: For each line of input read in DAXPY it could have
been a matrix. The Map function emits i as the key and A ∗ Xi as the value. • Reduce: Since all the computation was performed in the
Map phase, the identity reducer is used here. In contrast to other algorithms where no reducer is used to save time
on performance, we use a reducer here to ensure that the output data is sorted by key (thus the result vector ends up in the correct format). Figure 2 shows the running time of our Fast Fourier Transform implementation over varying problem sizes. Here, the problem size refers to the size of the input vector provided. It is important to note that the running time is measured in thousands of seconds, so even the smallest problem sizes take unreasonably long to run (a problem of size one hundred takes nearly an hour to run). We find that although this is a O(n log n) algorithm, we experience scaling respective to O(n). Unfortunately, the performance is completely unreasonable, as the amount of computation performed in the three primitives does not amortize the startup cost and communication costs. Future work will investigate if this result holds over larger problem sizes and how to relieve these performance concerns.
Figure 3. Run times for the two Integer Sort implementations over varying problem sizes
4.4 Integer Sort For this benchmark, we are required to generate a large amount of random numbers in the range [0, 10) and sort them via the bucket sort algorithm. We thus use MapReduce in the following fashion (programmed in Ruby and Java): • Pre-Computation: This work is identical to how we gen-
erated the pre-computation data for the Embarassingly Parallel benchmark (EP). • Map: Each mapper is given a chunk of the input file, of
which each line contains a tuple of data corresponding to the beginning and ending range of values it needs to produce. It generates the random numbers for all the values in the range it is responsible for and then emits two values: l and val, where l is the bucket that val fits into (that is, l = ⌊val⌋). • Reduce: We use the identity function for this reduction
(that is, we output the given key, value only). This is sufficient since the sorting phase has already sorted our data based on the bucket it falls under. Since this is not technically a bucket sort (Hadoop’s sort implementation is quick sort), we also provide code to perform this bucket sort as an alternative reducer. However, it is not used in these numbers since the numbers are already sorted by this point. We provide two implementations of this benchmark. One is in both Ruby using Hadoop Streaming and one is in Java using Hadoop MapReduce. As Figure 3 shows, however, the running time of the Java version increases exponentially with problem size. This was due to the inherent slowness of using Java’s BigDecimal library to acquire arbitrary precision variables and use arithmetic on it. Arbitrary precision is not required for this experiment, and we implemented a version of the code that uses Java’s highly optimized double primitive instead, but it keeps sufficient precision only on the uppermost bits (severly impacting correctness on our imple-
mentation). Ruby also provides arbitrary precision variables, but in contrast to Java, it starts off with a small sized variable and promotes it to a greater sized variable as needed. This saves both space and computation time. Both the arbitrary precision implementations (Java and Ruby) experience an exponential slowdown for growing problem sizes. Since each problem size we investigated was twice that of the previous size we implicitly expected to see the running time double in the best case. Surprisingly, many cases existed where this behavior was not seen. For example, both implementations perform roughly the same at problem sizes equal to and below the initial problem size, 28 . We were also interested in running this code over much larger problem sizes but found they took simply too long. We seek to return to this problem at a later time see if it becomes viable with a larger number of nodes. 4.5 Block Tridiagonal This algorithm solves a large sparse linear system of equations in block tridiagonal form (hence the name). At its core, this is an implementation of the Thomas tridiagonal matrix algorithm operation on the small dense sub-matricies that make up the non-zero blocks of the input matrix. This algorithm relies on four primitive functions that : matrix subtraction (an extension of DAXPY), matrix inversion via GaussJordann elimination, matrix multiplication and the solving of small dense linear systems via Gaussian elimination. We have already covered how DAXPY and matrix multiplication are implemented in MapReduce in the Conjugate Gradient section, so we will omit explaining how that is performed here. Gaussian elimination and Gauss-Jordan are very similar functions, so similar that they share general structure as well as all of the Map are reduce functions. These algorithms are broken up into three phases: pivoting, forward elimina-
tion, and backward substitution. One round of pivoting and forward elimination are performed for each row of the input matrix (the current column j is passed in and referred to as such to differentiate it from the “input row” i, the row given to the Map task. Pivoting is performed as follows (programmed in Java): • Pre-Computation: None, but the calling program will per-
form a pivot MapReduce run for each column in the input matrix (thus an n × n matrix requires n MapReduce runs to pivot it). • Map: Each mapper reads in the current column and their
input row and finds the value i, j (referred to as the “score”). It then divides each value of their input row by this value and emits 1 as the key, the score, and the input row number and modified row as the value. • Reduce: Since all the Map tasks produce the same key,
there is only one Reduce task. It simply searches through all the values and finds the value with the largest score. It outputs the corresponding row number and modified row. Forward elimination is implemented in a similar manner (programmed in Java): • Pre-Computation: Like before, the calling program will
perform a forward elimination MapReduce run for each column in the input matrix (thus an n × n matrix requires n2 MapReduce runs to forward eliminate it). The input data contains n lines of data, with each line consisting of the current column and the input row. • Map: Each Map task reads in the current column and
the input row and finds the value corresponding to the intersection of the two (which we will call k). It then emits the row number as the key and the new vector (i − k × j) as the value. • Reduce: No computation is needed here, so we use the
identity reducer here. Optimizations to this algorithm eliminate the Reduce phase altogether since we don’t need the output keys to be sorted. Finally, backward substitution is performed in serial. After pivoting the last row is independent of all other rows, the second to last row is dependent only on the last row and we solve for the unknown by substitution. This is repeated on each row until we have solved for all the unknown values. Matrix multiplication and matrix subtraction each only require a single round of MapReduce for their computations, however both Gaussian elimination and matrix inversion require two rounds of MapReduce per column of the matrix being computed. The block tridiagional algorithm requires a matrix inversion and Gaussian elimination for each row of sub-matricies. This ends up running an exceedingly large number of MapReduce jobs (each over a relatively small input size) and causes the algorithm to be unreasonably slow.
Figure 4. Run times for the Gaussian Elimination implementation Since Gaussian elimination requires the most number of MapReduce jobs, we present the performance of it in Figure 4. We can see that the running time increases linearly with respect to problem size, but that the amount of time taken is simply far too long even across very small problem sizes. The large number of MapReduce jobs run generates a large amount of communication, which in our example produces a relatively small amount of computation. Future work will investigate how we can correct this imbalance and provide a feasible method by which matrix inversion can be performed.
5. Conclusions This work contributes MapScale, an open-source implementation of a number of scientific computing algorithms that run seamlessly over the MapReduce framework. While future work will be aimed at integrating it with AppScale so that users can run these algorithms over large numbers of nodes, the current work produced here shows a number of interesting results. It is already known that MapReduce works effectively over embarassingly parallel algorithms, which is clearly reflected in the performance of our Embarassingly Parallel and Integer Sort benchmarks. More interestingly, we see that MapReduce is not well suited to the iterative algorithms explored here. This effect is more pronounced in the algorithms we chose, where each iteration runs a number of MapReduce tasks. Furthermore, since the computation is extremely light in these tasks, the cost of running MapReduce in these scenarios is not amortized effectively. That is not to say there is no hope for these problems. For each non-optimal scenario in this paper, we have recognized what the problem is and suggested solutions to them. For
all the problems involved, we also wish to investiate the impact of adding additional nodes to the system (that is, if we can always achieve a linear speedup as we add more nodes). Regardless, this work contributes a great deal to the current state of MapReduce on non-trivial applications and provides a solid ground for future work to build on.
References [1] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. The NAS Parallel Benchmarks. In RNR Technical Report RNR-94-007, 1994. [2] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. In Journal of Parallel and Distributed Computing, 37(1):55-69, 1996. [3] Navraj Chohan, Chris Bunch, Sydney Pang, Chandra Krintz, Nagy Mostafa, Sunil Soman, and Rich Wolski. AppScale Design and Implementation. In UCSB Technical Report Number 2009-02, 2009. [4] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, 2004. [5] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In 19th ACM Symposium on Operating Systems Principles, 2003. [6] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, and Dmitrii Zagorodnov. The Eucalyptus Open-source Cloud-computing System. In Proceedings of Cloud Computing and Its Applications, 2008. [7] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA ’07, 2007. [8] Jackson H.C. Yeung, C.C. Tsang, K.H. Tsoi, Bill S.H. Kwan, Chris C.C. Cheung, Anthony P.C. Chan, and Phillip H.W. Leong. Map-reduce as a Programming Model for Custom Computing Machines. In 16th International Symposium on Field-Programmable Custom Computing Machines, 2008.