Computing median values in a Cloud environment ... - Google Sites

3 downloads 162 Views 175KB Size Report
Computing median values in a Cloud environment using GridBatch and .... 1) First, databases present a high level query l
Computing median values in a Cloud environment using GridBatch and MapReduce Huan Liu, Dan Orban Accenture Technology Labs {huan.liu,[email protected]}

Abstract—Traditional enterprise software is built around a dedicated high-performance infrastructure and it cannot map to an infrastructure cloud directly without a significant performance loss. Although MapReduce[1] holds the promise as a viable approach, it lacks building blocks that enable high-performance optimization, especially in a shared infrastructure. Following on our previous work [2], we introduce another building block called the Block Level Operator (BLO) and we show how it can be applied to solve a real enterprise application of finding the medians in a large data set. We propose two efficient approaches to compute medians, one using MapReduce and the other using the BLO. We compare the two approaches, as well as with that of using the traditional enterprise software stack, and show that our approach using the BLO gives an order of magnitude of improvement.

I. I NTRODUCTION Because of its on-demand and pay-per-use nature, an infrastructure cloud, such as Amazon EC2, is ideal for applications with widely varying computation demand. Primary examples are large-scale data analysis jobs such as monthly reporting of large data warehouse applications, nightly reconciliation of bank transactions or end-of-day access log analysis. Their computation profile is as shown in Fig. 1. Because of business constraints, these jobs have to finish before a deadline. Enterprises typically provision dedicated server capacity up front, hence, the server capacity will be idle most of the time when the jobs are not run, wasting valuable computation resources. Although these large-scale data analysis jobs could benefit greatly from an infrastructure cloud, it is not straightforward to port these applications over because of the following constraints. First, because of the business model, a cloud is based on commodity hardware in order to lower the cost of computing. Commodity hardware not only has limited computing power per machine, it also has lower reliability. For example, Amazon only offers x86 servers and the largest one is equivalent to a 4 core 2GHz opteron processor with 16GB memory. However, enterprise applications are typically architected such that they require high-end servers and they rely on hardware to achieve both scaling and high reliability. For example, SUN E25K server, a widely used platform in enterprises, has up to 72 processors and 1TB memory. Migrating these applications to the cloud infrastructure would require them to be re-architected. Second, the network bandwidth between two machines in the cloud is much lower than that in a traditional enterprise infrastructure. The commodity cloud business model requires commodity networking hardware, so

Fig. 1. Computation profile of large-scale data analysis jobs. Large computation capacity is required for a short period of time. If dedicated computing power is provisioned, it will be idle most of the time.

the network link is typically 1Gbps or less, compared to the typical 10Gbps links in the Enterprise. The multi-tenancy nature of the cloud further limits the actual bandwidth to a fraction of the physical link bandwidth. For example, our own measurement of the network throughput on an Amazon EC2 server varies between 250Mbps and 500Mbps. To overcome these challenges, we developed GridBatch [2], a system that helps easily parallelize large-scale data analysis jobs. GridBatch innovates in two different aspects. First, it is a parallel programming model. If the user could express the parallelism at the data level, which is abundant in enterprise applications, the system could automatically parallelize the computation to thousands of computers. The associated implementation not only hides the details of parallel programming, but also alleviates the programmers from much of the pain, such as implementing the synchronization and communication mechanisms, or debugging transient behaviors of distributed programs. Second, GridBatch is designed to run in a shared cloud environment. It is designed to run on thousands of commodity computers to harvest the collective computation power, instead of counting on the scaling up of a single high-end computer system. It replicates data over several computing nodes to guard against potential failures of commodity machines, and it can gracefully restart computation if any computing node fails. Further, it exploits and optimizes the local hard disk bandwidth in order to minimize the network bandwidth consumption. As we articulated in our earlier paper [2], our goal is not to help the programmers find parallelism in their applications. Instead, we assume that the programmers understand their applications well and are fully aware of the parallelization potentials. Further, the programmers have thought through on how

2

to break down the application into smaller tasks and how to partition the data in order to achieve the highest performance. But, instead of asking the programmers to implement the plan in detail, we provide a library of commonly used “operators” (a primitive for data set manipulation) as the building blocks. All the complexity associated with parallel programming is hidden within the library, and the programmers only need to think about how to apply the operators in sequence to correctly implement the application. In this paper, we introduce another operator called the Block Level Operator (BLO). As its name implies, it is designed to exploit the parallelism at a data chunk (an aggregate of many records) level as opposed to at the record level. We propose two approaches on finding medians in a large data set, one using MapReduce, the other using the BLO of GridBatch. Both approaches outperform the typical approach used in enterprises today by at least an order of magnitude. Further, the GridBatch approach uses one third the time compared to that using MapReduce for the small data set we evaluated. The trend line shows that the gap widens as the data set grows. II. P RIOR WORK Enterprises typically use databases to implement the largescale data analysis jobs that GridBatch targets. But using a database has several drawbacks. 1) First, databases present a high level query language with the goal of hiding the execution details. Although easy to use, as noted by others [3], this high level language forces users to express computation in ways that lead to poor performance. Sometimes, the most efficient method would only scan the data set once, but when expressed in SQL, several passes are required. An automatic query processor would not be able to uncover the most efficient execution and achieve as high a performance as what can be achieved by a programmer who understands the application well. 2) Second, current commercial database products run well in a traditional enterprise infrastructure architecture where network bandwidth is plentiful, but they suffer badly in a cloud infrastructure because they are not able to exploit the local disk I/O bandwidth. Even though most database systems, including Oracle’s commercial products, employ sophisticated caching mechanisms, many data accesses still traverse the network, consuming precious bandwidth. 3) Third, the query language used in most commercial database implementations lacks complex logic processing capability. Because of this, ETL (Extract Transform Load) tools have found wide spread use along with databases. ETL tools are not only used to extract and load data, but in many cases they are frequently used in order to apply complex logic operation to transform the data. Data are shuffled from one staging table to another while ETL is applied to transform the data into a form that is usable by SQL queries. This process requires the large data set to be scanned and transferred more times than necessary, resulting in significant increase in processing time.

Google’s MapReduce [1] inspired our work. We inherit many similar designs in our system because they made parallel programming in the cloud easy. Specifically: 1) MapReduce and our system exploit data level parallelism, which is not only easier to exploit, but it is also abundant in large-scale enterprise applications. If the user could express the data parallelism, the job could be parallelized over thousands of computing nodes. 2) MapReduce and our system allow the users to specify arbitrary logic operations in the form of user-defined functions. The system applies these functions in parallel in a pre-defined pattern, and at the same time, not restricting the user from the kind of logic to be applied. 3) MapReduce and our system are built to tolerate failures in the commodity cloud infrastructure. Data are replicated several times across the computing nodes. If one node fails, the data can still be recovered. Similarly, the unfinished computation on the failing node could be restarted gracefully on other computing nodes. Our system shares many similarities with MapReduce. In fact, our system is built on top of Hadoop[4], an open source implementation of MapReduce. Although similar, we differ in two regards. First, in addition to MapReduce, our system has a family of operators, where each operator implements a separate parallel processing pattern. In this paper, we propose an additional operator called the “Block Level Operator” to take advantage of parallelism at the data chunk level. Second, we have extended the Distributed File System (DFS) to support an additional file called “Fixed-num-of-Chunks (FC)” files. The user specifies a hash function, and the system partitions the file into distributed chunks. FC files, along with the family of operators, complement the existing capabilities in Hadoop and allow us to minimize unnecessary data shuffling, optimize data locality and reduce redundant steps. As a result, high performance can be achieved even in the challenging cloud infrastructure environment. This kind of control is especially important in enterprise data analysis applications where most data are structured or semi-structured. Microsoft Dryad [5], as well as the DryadLINQ[6], [7] system built on top of it, is another system similar to Google’s MapReduce and ours. Compared to MapReduce, it is a much more flexible system. Users express their computation as a dependency graph, where the arcs are communications and the vertices are sequential programs. Their solution is very powerful because all parallel computations, including MapReduce, can be expressed as dependency graphs. If it is available publicly, we could have built all our capabilities on top of Dryad instead of Hadoop. Another recent system, Map-Reduce-Merge [8], extends the MapReduce framework to include the capability to merge two data sets, which is similar to the capability of the Join operator that we introduced in the first version of GridBatch[2]. Facebook Hive[9], Yahoo Pig [3] and Google Sawzall [10] focus on the same analytics applications as we do, but all took an approach similar to database systems. They present to the users a higher level programming language, making it easy to write analytics applications. However, because the users are shielded away from the underlying system, they could

3

not optimize the performance to the fullest extent. In fact, Facebook Hive, Yahoo Pig and Google Sawzall are all built on top of Google’s MapReduce, which lacks a few fundamental operators for analytics applications. III. T HE G RID BATCH

SYSTEM

In this section, we briefly describe the capabilities introduced in the first release of GridBatch [2] that are relevant for the following discussion. We refer interested readers to the original paper [2] for details. Then we introduce the Block Level Operator (BLO) – one of the contributions of this paper. The GridBatch system consists of two pieces of related software components: the Distributed File System (DFS) and the job scheduler. A. Distributed File System DFS is an extension of Hadoop File System (HFS), an open source implementation of GFS [11], that supports a new type of file: Fixed-num-of-Chunk (FC) files. DFS is responsible for managing files and storing them across all nodes in the system. A large file is typically broken down into many smaller chunks, and each chunk may be stored on a separate node. Among all nodes in the system, one node serves as the name node, and all other nodes serve as data nodes. The name node holds the name space for the file system. It maintains the mapping from a DFS file to the list of chunks, including which data node a chunk resides on and the location on the data node. It also responds to queries from DFS clients asking to create a new DFS file, as well as allocates new chunks for existing files or returns chunk locations when DFS clients ask to open an existing DFS file. A data node holds chunks of a large file. It responds to DFS client requests for reading from and writing to the chunks that it is responsible for. A DFS client first contacts the name node to obtain a list of chunk locations for a file, then it contacts the data nodes directly to read/write data. There are two data types in GridBatch: table or indexed table (borrowed from database terminology). A table contains a set of records (rows) that are independent of each other. All records in a table follow the same schema, and each record may contain several fields (columns). An indexed table is similar to a table except that each record also has an associated index, where the index could simply be one of the fields or other data provided by the user. DFS stores two types of files: Fixed-chunk-Size (FS) files or Fixed-num-of-Chunk (FC) files. FS files are the same as the files in HFS and GFS. They are broken down into chunks of 64MB each. When new records are written, they are appended to the end of the last chunk. When the last chunk reaches 64MB in size, a new chunk is allocated by the name node. For indexed tables, we introduced another type of file: FC files, which have a fixed number of chunks (denoted as C, defined by the user) and each chunk could have an arbitrarily large size. When a DFS client asks for a new file to be created, the name node allocates all C chunks at the same time and returns them all to the DFS client. Although the user can choose C to be any value, we recommend a C should be

chosen such that the expected chunk size (expected file size divided by C) is small enough for efficient processing, e.g., less than 64MB each. Each FC file has an associated partition function which defines how data should be partitioned across chunks. The DFS client submits the user-defined partition function (along with the parameter C) when it creates the file, which is then stored by the name node. When another DFS client asks to open the file later, the partition function is returned to the DFS client, along with the chunk locations. When a user writes a new data record to DFS, the DFS client calls the partition function to determine the chunk number(s), then it appends the record to the end of the chunk(s). Through the partition function, FC files, along with the Distribute operator [2], allow the user to specify how data could be grouped together for efficient local processing. Note that both FC files and the Distribute operator are aware of the number of partitions associated with a file, but they are unaware of the number of machines in the cluster. The file system takes care of the translation from the partition to the physical machine. By having a translation layer in the file system, we hide the possible dynamic nature of the underlying infrastructure, where servers can come and go (for example, because of failures). B. The job scheduling system The job scheduling system is the same as that of MapReduce. It includes a master node and many slave nodes. The slave node is responsible for running a task assigned by the master node. The master node is responsible for breaking down a job into many smaller tasks as expressed in the user program. It distributes the tasks across all slave nodes in the system, and it monitors the tasks to make sure all of them complete successfully. In general, a slave node is often a data node. Thus, when the master schedules a task, it could schedule the task on the node which holds the chunk of data to be processed. By processing data on the local node, we save on precious network bandwidth. GridBatch extends Google’s MapReduce system with many operators, each implements a particular pattern of parallel processing. MapReduce could be considered as two separate operators, Map and Reduce, applied in a fixed sequence. The Map operator is applied to all records in a file independent of each other; hence, it can be easily parallelized. The operator produces a set of key-value pairs to be used in the Reduce operator. The Reduce operator takes all values associated with a particular key and applies a user-defined reduce function. Since all values associated with a particular key have to be moved to a single location where the Reduce operator is applied, the Reduce operator is inherently sequential. The first release of GridBatch introduced several operators including Map, Distribute, Join, Recurse, Cartesian, and Neighbor. The Map operator is designed to exploit the parallelism at the data record level. The system applies a userdefined Map function to all data records in parallel. We also introduced the Neighbor operator which exploits parallelism

4

in sequential analysis where only neighboring records are involved. The user provides a user-defined function which takes the current record and its immediate k neighbors as inputs and the system applies this function to all records in parallel. For details of the other operators, we refer interested readers to our earlier paper [2]. C. Block level operator In this section, we introduce the BLO operator. In addition to exploiting parallelism at the record level (Map operator) and at the neighbor level (Neighbor operator), the BLO operator allows us to exploit parallelism at the chunk level. As an example, we will show how it can be used efficiently to compute medians from a large data set. The BLO operator applies a user-defined function on a chunk at a time, where a chunk is a set of records which are stored logically and physically in the same location in the cluster. The users invoke the BLO operator as follows: BLO(Table X, Func bloFunc) where X is the input table, and bloFunc is the custom function provided by the user. bloFunc takes an iterator of records as an argument. When iterating through the iterator, the records are returned in the same order as when they were written to the chunk. A sample bloFunc pseudo-code for counting the number of records in a chunk is as follows: bloFunc(Iterator records) int count=0; for each record x in records count ++ EmitResult(Table Z, count) This user-defined function counts the number of records in the input iterator, and at the end, it adds the count value to a new Table Z. At the end of this BLO, each chunk will produce a count value. To get the overall count, a MapReduce or a Recurse operator has to be applied to sum up all values in Table Z.

Fig. 2. Comparison between the Map, Neighbor and BLO operators. (a) Map, (b) Neighbor, (c) BLO.

Fig. 2 shows a comparison between the Map, Neighbor and BLO operators. The Map operator is designed to exploit parallelism among independent records. The user-define Map function is applied to all records at the same time. The

Neighbor operator is designed to exploit parallelism among sub-sequences when analyzing a sequence of records. The user-defined Neighbor function is applied to all sub-sequences at the same time. The BLO operator implements another pattern of parallel processing. The user-defined BLO function is applied to all chunks at the same time, however, the processing within the chunk could be sequential. The BLO operator works in conjunction with the FC files, where all data that have to be processed sequentially are arranged in the same chunk already. A chunk is guaranteed to be stored physically on the same node, and hence, it can be efficiently processed locally without consuming network bandwidth. There are a couple of ways to shuffle data into the correct chunks. When data are written into DFS, the user could choose to write to a FC file with a user-defined partition function. The user-defined partition function makes sure that the correct data are loaded to the correct chunks. Alternatively, if the data are already stored in a FS file, the user could invoke the Distribute operator. Again, the user would supply a partition function, which makes sure that data are loaded correctly. The BLO operator can be considered as the Reduce portion of the MapReduce operator, except that it is a stand-alone operator and it involves no sorting and grouping by key. It is implemented as a child class of the Task class, the base class for both the MapTask and ReduceTask classes in the Hadoop implementation. We inherit from Task instead of ReduceTask because BLO does not need the data shuffling and sorting operations in the ReduceTask class. Similar to the Join operator we introduced in [2], the functionality of the BLO operator could be implemented with MapReduce. However, as we will see in our application of computing medians, using MapReduce would be very inefficient, since it would have to invoke the identity Mapper, shuffle all data around and sort the data unnecessarily. This is especially bad when multiple passes of MapReduce are involved, where the work done in one MapReduce pass would have to be repeated in the next pass since there is no mechanism to save the intermediate data in the MapReduce framework. IV. F INDING MEDIANS To evaluate the applicability and performance of the BLO operator, we consider a real enterprise application–a data warehouse application for a large financial services firm. The company has tens of millions of customers, and they are interested in collecting and reporting high-level statistics, such as average and median, about their customers’ account balances. They want to collect these statistics across many different dimensions of their customer base. For example, across age groups, what is the balance for 20-30 years old, 30-40 years old, etc.; or across industries, what is the balance for customers in retail or high-tech industries. They are also interested in a combination of many dimensions, such as across age groups within different industries or across job tenure length within different geographies. We use the term “segmentation” to refer to a particular combination of the dimensions. For example, computing medians

5

across age group is one segmentation and computing medians across both age group and industry is another segmentation. We use the term “bracket” to refer to a range within a segmentation. For example, users that are 20-30 years old and are in the retail industry form one bracket. We need to compute one median for each bracket, and many medians for each segmentation, where each median corresponds to one bracket within the segmentation. We denote the number of dimensions by D and the number of segmentations by S. In the worst case, S could be as large as D!. The input to the problem is a large fact table with tens of millions of rows. Each row holds all relevant information specific to a customer including the customer’s account balance, birthday, industry, geography, job tenure length, education, etc. Computing the average is relatively easy because one can simply sum up the total and divide it by the count, where both the total and count are easy to compute in parallel with MapReduce. However, computing median is quite awkward with MapReduce, because it requires sequential processing. A straightforward implementation would first sort all data and then find the middle point. Both steps are sequential in nature, and hence, they take a long time to complete for a large data set. The problem gets worse in our case when there are a large number of median computations. One of the contributions of this paper is designing two efficient approaches, one using MapReduce, the other using the BLO operator, to compute medians. In the following, we first describe the traditional approach to compute median and point out the deficiencies, and then we describe our approaches using MapReduce and the BLO. A. Traditional enterprise approach The most common solution in enterprises today for largescale data warehousing applications is to use a database. Once the fact table is loaded into the database, one can simply write SQL queries to compute the 50 percentile value, or call the median function directly if available from the SQL platform. When computing medians for a segmentation, it is more efficient to write one SQL query to compute medians for all brackets within the segmentation. This can be achieved by a combination of the group by and case clauses. An example for the age group segmentation is as follows: select age_group, median(balance) from (select balance, age_group=(case 20 < age

Suggest Documents