Beyond Simple Integration of RDBMS and MapReduce ... - IEEE Xplore

4 downloads 112278 Views 508KB Size Report
Beyond Simple Integration of RDBMS and MapReduce – Paving the Way toward a. Unified System for Big Data Analytics: Vision and Progress. Xiongpai QIN1, 2 ...
2012 Second International Conference on Cloud and Green Computing

Beyond Simple Integration of RDBMS and MapReduce – Paving the Way toward a Unified System for Big Data Analytics: Vision and Progress Xiongpai QIN1, 2,3, Huiju WANG1, 2,3, Furong LI1, 2,3, Baoyao ZHOU4, Yu CAO4 Cuiping LI1, 2,3, Hong CHEN1, 2,3, Xuan ZHOU1, 2,3, Xiaoyong DU1, 2,3, Shan WANG1, 2,3 1

Ministry of Education Key Lab of Data Engineering and Knowledge Engineering (RUC), Beijing, 100872, P.R.China 2 Sa Shi-Xuan Big Data Management and Analytics Research Center (Sino-Australia), Beijing, 100872, P.R.China; 3 School of Information, Renmin University of China, Beijing, 100872, P.R.China 4 EMC Labs China, 100084, P.R.China [email protected], [email protected], [email protected], [email protected], [email protected] [email protected], [email protected], [email protected], [email protected], [email protected] challenges, among which the volume is the dominant one, followed by data variety. To extract information from a variety of data with a big volume, the data processing system should employ a scale out architecture, which means that when the data volume gets even larger, we can simply add more nodes to achieve expected performance. RDBMS is not ready for the coming of big data era, firstly RDBMS can not scale to a large cluster with thousands of nodes due to the enforcement of ACID constraint and other factors, secondly RDBMS can not handle semi-structured and unstructured data well, putting such data in RDBMS can not yield high enough performance (please refer to a comparison of RDBMS and graph database for graph data processing [2]). Google published its unstructured data parallel processing technique – MapReduce in 2004[3], just like throwing a stone into the calm surface of a water, the technique has broken the peace, and not only attracted great interest from parallel computing and database research community, but also significantly changed the competition landscape of data management industry. The wind of MapReduce blew through parallel computing community and aroused a tide of research during the period of 2006 to 2008, the database community caught up, and pushed the work forward with another tide of research during the period of 2009 to 2012. The research has touched almost every aspect of MapReduce [4], including (1) storage layout, index, and data variety, (2) extension of MapReduce for stream processing, iterative processing, and leveraging the large memory etc. (3) joining optimization, parallelization of complex analytical algorithms, (4) schedule strategies for multi core CPU, GPU, heterogeneous environments, and cloud,(5) Energy saving, private and security guarantee, (6) easy to use interfaces including SQL, data mining, machine learning, multimedia data processing, and science data processing. A bunch of startups sprung up, built their techniques and businesses around MapReduce. Tides of research in academia and surging of startups in industry forebodes a new ecosystem surrounding MapReduce. We see that RDBMS and MapReduce, as well as the ecosystems built around them, basically serve the same purpose, namely data analytics. In fact MapReduce can do

Abstract—MapReduce has shown vigorous vitality and penetrated both academia and industry in recent years. MapReduce not only can be used as an ETL tool, it can do even much more. The technique has been applied to SQL summation, OLAP, data mining, machine learning, information retrieval, multimedia data processing, science data processing etc. Basically MapReduce is a general purpose parallel computing framework for large dataset processing. A big data analytics ecosystem built around MapReduce is emerging alongside the traditional one built around RDBMS. The objectives of RDBMS and MapReduce, as well as the ecosystems built around them, overlap much really, in some sense they do the same thing and MapReduce can accomplish more works, such as graph processing, which RDBMS can not handle well. RBDMS enjoys high performance of relational data processing, which MapReduce needs to catch up. The authors envision that the two techniques are fusing into a unified system for big data analytics. With the ongoing endeavor to build up the system, much of the groundwork has been laid while some critical issues are still unresolved, we try to identify some of them. Two of our works as well as experiment results are presented, one is applying a hierarchical encoding to star schema data in Hadoop for high performance of OLAP processing, another is leveraging the natural three copies of HDFS blocks to exploit different data layouts to speed up queries in a OLAP workload, a cost model is used to route user queries to different data layouts. Keywords- RDBMS; MapReduce; Big Data Analytics; Vision; Unified System; OLAP

I.

CHALLENGES OF BIG DATA ANALYTICS AND THE RISE OF MAPREDUCE

With the decline of the price of storage devices, exploding applications of internet and sensor networks, and requirement of science research etc., it is the first time that people accumulate datasets with a big volume that is unseen before, the age of big data is coming. Major big data sources include electronic business, social media and social network and other internet applications, sensor network, internet of things, and science experiments. Big data is often characterized by four Vs [1], namely Volume, Velocity, Variety, and Variability. The four Vs articulate a range of 978-0-7695-4864-7/12 $26.00 © 2012 IEEE DOI 10.1109/CGC.2012.39

716

output that is aggregated by the reduce function later to produce the final result. The map and reduce interface is simple, but in the same time it is not that simple. Besides simple SQL summation, researches have migrated complex algorithms onto the MapReduce platform including OLAP, data mining, machine learning, information retrieval, multimedia data processing, science data processing, and many more algorithms including graph processing algorithms [5] can be run on the MapReduce platform. MapReduce now can not only handle unstructured data, it can also handle structured data efficiently too [6]. Virtues and limitations of the two are listed in the following table.

much more in terms of analytic complexity and data variety. We envision a unified system for big data analytics in the future, which fuses RDBMS and MapReduce into a single platform. Two of our works as well as experiment results are presented, one is applying a hierarchical encoding to star schema data in Hadoop for high performance of OLAP processing, another is leveraging the natural three copies of HDFS blocks to exploit different data layouts to speed up queries in a OLAP workload, a cost model has been developed to route user queries to different data layouts. II.

VIRTUES AND LIMITATIONS OF RDBMS AND MAPREDUCE

TABLE I.

From the angle of data processing, RDBMS and MapReduce could be compared to each other. Since introduced in 1970s, relational model has been extensively studied. Relational database system (RDBMS) is the strongest force in the database market and creates a big revenue every year. The data stored in RDBMS is highly structured, it can be reached by different access paths. Various index techniques have been developed for fast data accessing. Users use a standard declarative language – SQL to access the data. They do not need to care about what the data layout is, what indexes can be used, and what algorithm is more suitable for data joining etc., an optimizer does the job, it search in a large space for an efficient execution plan for every user query. RDBMS has several advantages including separation of physical storage and logical schema, highly data consistency, highly reliable, and high performance. In the age of big data, RDBMS encounters some difficulties when handling large volume of data. Firstly it does not scale well, when large volume of data need to be processed, scaling out is the choice. RDBMS has not been deployed onto a cluster of more than 1000 nodes while MapReduce has been deployed to a cluster of nearly 4,000 nodes in Yahoo. Some people may argue that we do not need a cluster that large, a dozen of powerful nodes can do the same work well when every node is equipped with powerful CPUs, large memory and high speed storage. These people forget that when IBM’s engineers designed the first PC, they thought a memory of 640KB would never be used up. And now what has happened? Secondly RDBMS can not handle some semi-structured and unstructured data well, using graph data as an example, storing graph data in relational table is not straight forward, and the joining work will slow down the whole system when users need to traverse the graph or perform the so-called “relationship analysis” on it. MapReduce was created by Google mainly to process big volume of unstructured data. Due to its nice properties of highly scalability and highly fault tolerance, the technique received attention from both industry and academia. MapReduce is a general execution engine that is ignorant of storage layouts and data schemas. The runtime system automatically parallelizes computations across a large cluster of machines, handles failures and manages disk and network efficiency. The user only needs to provide a map function and a reduce function. The map function is applied to all input rows of the dataset and produces an intermediate

COMPARISON OF RDBMS AND MAPREDUCE

Comparison Item Schema Support

Strict Schema

RDBMS

No schema(1)

MapReduce

Index

Various index available

No Index(2)

Programming Model/ Flexibility

SQL, a declarative language, enough for data summation, easy to use More complex analytics needs a UDF programmed using some other language

Map function and reduce function, rather low level interface, not easy to use very flexible

Optimization

a complex optimizer tries to find optimal execution plans for queries

needs to be improved(3)

Execution Strategy

Restart the whole query when fails happen(4)

A job is broken into tasks. restart of tasks of fine granularity when fails happen

Fault Tolerance Scalability

Low Low, mainly scaling up, Limited scaling out onto hundreds of nodes

Heterogeneous

limited support

High High, can be scale out onto a large cluster of more than 1000 nodes fully support

Note: (1) After applying some structure to HDFS blocks, MapReduce can handle structured data [6] [7]; (2) Some researches have tried to accelerate data access of MapReduce by adding some form of indexes; (3) Some works have been done on job scheduling optimization, parallelization and optimization of complex algorithms for MapReduce in recent years; (4) When the intermediate data has been checkpointed regularly, the query does not need to restart from the very begging when fails happen.

III.

INTEGRATION OF RDBMS AND MAPREDUCE: CURRENT SOLUTIONS AND SHORTCOMINGS

From the above table, we can see that in some sense, the two techniques can be complementary to each other. Researchers have been working to integrate strengths of the two, and avoid weaknesses of them. There are different types of integration solutions between RDBMS and MapReduce, including loosely coupled ones and tightly integration solutions [8]. A. Loosely Coupling of RDBMS and MapReduce There two types of loosely coupling solutions. Firstly, MapReduce can be used as an ETL tool [9], preprocesses raw material to extract the data, transform it, and finally load it into a RDBMS. Some summation could be done during the ETL phase, but more complex analytics are performed over the RDBMS. On the other hand, in a transactional production system, RDBMS stores all business data. With the time elapsing, more and more data is accumulated in the RDBMS, hindering the RDBMS to run efficiently. Keeping the production system slender is the key for continuous operation with high efficiency. The aging data should be

717

cleared from the production system, the data can be stored in MapReduce for simply archiving or for later analysis.

tolerance of MapReduce is achieved through three-copy replication of HDFS blocks, when storing data in PostgreSQL, HadoopDB lost fault tolerance guarantee from MapReduce framework; (3) when data joining can not be satisfied within one node, HadoopDB must join tables across network, it is outside the beyond of individual RDBMS instance, the performance of the join operation relies on MapReduce framework scheduling and execution strategy which need more optimization on the upper level[14], which has not been done completely. We appreciate the work of RCFile [7] by Facebook. The work borrows ideas from RDBMS research, it applies an elaborate structure to HDFS blocks to achieve higher performance of data accessing while fully retaining nice properties of scalability and fault tolerance of MapReduce. The work could be classified into the category of Integrate RDBMS into MapReduce approach, but in a more fusing manner than HadoopDB. In RCFile, firstly data (structured table) is horizontal partitioned into blocks. In each block, the data is further broken into individual columns and every column is store contiguously in the block. The idea is really the PAX layout that has been extensively studied in RDBMS community. Because the data is stored in blocks using a columnar layout, compression techniques can be applied to reduce space consumption.

B. More Tightly Coupling of RDBMS and MapReduce We identify two types of tightly coupling of RDBMS and MapReduce, Integrate RDBMS into MapReduce and Integrate MapReduce into RDBMS respectively. Aster Data[10] and Greenplum[11] are two startups founded in 2005, the relationship between them resembles the relationship between KFC and McDonald, because they all combine PostgreSQL database, shared nothing architecture and MapReduce technique into their product to provide a database for big data analytics, they are competitors. Their databases employ an Integrate MapReduce into RDBMS strategy to providing MapReduce computing capability. These systems are not MapReduce systems in a conventional sense, but provide MapReduce style computation inside RDBMS. Put in another way they are relational databases that have MapReduce functionality. The common feature of Aster Data and Greenplum is that, the core engine of the two systems can not only run SQL queries, but also act as MapReduce job execution engine, and the data can flow from SQL operators to MapReduce tasks and vise versa for complex processing. MapReduce style parallelism of statistical and data mining functions achieved much higher performance over traditional UDF (user defined function) in relational database [10]. From the technical view point, the integration approaches that adopted by Aster Data and Greenplum do not fully incorporate the most important properties of MapReduce framework, namely highly scalability and highly fault tolerance. We are not expected to see a database system based on Aster Data or Greenplum, can be deployed onto a cluster consisting of more than 4000 nodes in the near future(please refer to section II to see why we need to scale to a cluster as large as 4000 nodes and more). The objective is difficult to achieve without fundamental modification to the underlying of Aster Data and Greenplum. HadoopDB [12] [13] (now has been commercialized into Hadapt) is the first attempt to adopt an Integrate RDBMS into MapReduce approach. HadoopDB combines Hadoop and PostgreSQL together to create a hybrid database that can process large amount of structured and unstructured data. Hadoop, on the upper level, takes care of job scheduling, task coordination and parallelization, as well as communication issues; and individual PostgreSQL instances are parts of the storage layer on the bottom level. When HadoopDB received an SQL query, it is translated into a MapReduce job and then broken into sub queries for individual database instances. When sub queries reach database instances, the optimizer considers factors such as indexes available, shared I/O, buffer management, data layout, compression etc., tries its best to find an optimal execution plan for the sub query. HadoopDB tries to retain high scalability and fault tolerance of MapReduce while leveraging high performance of RDBMS, but it has several disadvantages: (1) The first is a longer data loading time, HadoopDB has longer data loading time than ordinary MapReduce systems; (2) Because the property of fault

IV. WHY TWO TECHNIQUES/ECOSYSTEMS FOR ONE THING? THE VISION OF A UNIFIED SYSTEM FOR BIG DATA ANALYTICS A big data analytics ecosystem built around MapReduce is emerging alongside the traditional one built around RDBMS. The objectives of RDBMS and MapReduce, as well as the ecosystems built around them, overlap much really, in some sense they do the same thing and MapReduce can accomplish more works, such as graph processing(social network analysis), which RDBMS can not handle well. Why two techniques/ecosystems for one thing? We envision that the two techniques are fusing into a unified system (see figure 1-a) for big data analytics. Figure 1-b depicts the whole system using a diagram with a layering structure. With the ongoing endeavor toward finally building up the unified system (of course, most engineering work should be done by vendors), much of the groundwork has been laid in recent years [4] while some critical issues remain unresolved, which deserve serious investigation. A. The Intelligent Storage Layer The Intelligent Storage Layer is responsible for how data is organized, i.e. storage management, data layout, deduplication and compression, indexing, and meta-data management. We emphasize the blending of structured data and unstructured data in one storage layer. For one reason, one of the characteristics of big data is variety. Graph data can be used as an example, graphs nowadays are used to tackle more and more problems, and a graph view of data can become a standard view on data, side by side with traditional relational data as well as other data types. Graphs grow larger and larger and users need to mind the graphs. A

718

scalable storage layer for various data types including graph data is needed.

optimizer for job scheduling is a challenge task. Many factors should be taken into consideration, the optimizer should work across various decision points and recognize interactions among different levels, including resource provisioning(node heterogeneity, node failures, network topology), job level factors, task level factors, storage level factors( storage layout, index available, …). The optimizer should be both data-aware and workload-aware. The next generation of MapReduce architecture adds some enhancements in scalability, performance, and availability. The new framework decouples MapReduce computing paradigm from the resource management architecture and enables new application types to plug into the Hadoop platform (through Application Manager), including stream processing, graph processing, bulk synchronous processing, and message passing interface(MPI)[15]. It is a piece of good news for the envisioned unified system. For efficient execution of complex algorithms (including data mining, machine learning, graph processing etc.), the algorithms should be parallelized as far as possible so that they can exploit the capability of a large cluster. Some algorithms such as K - means, run in an iterative manner, which demands that the big data analytics platform should provide system-level support. Some issues remain unresolved (some are partially resolved), such as: (1) How to leverage the large memory of the nodes in a cluster to facilitate the MapReduce computation, not to always save the intermediate data onto disks, while guaranteeing high degree of fault tolerance. (2) How to enable sharing of scans, computation, sorting, shuffling, and output generation among data intensive jobs/tasks on a large cluster. (3) To handle the continuously coming data timely, some incremental algorithms should be developed for specific analytical purposes. Incremental algorithms sometimes can not find hidden patterns spanning a long time period, how to combine offline analytical results with on-the-fly analytical results deserves research effort. The optimization on the scheduling and execution layer is not only to achieve higher performance, but also to reduce resource consumption. Energy saving is an issue that requires more study. How to provide the guarantee of private and security in data intensive computing environments while incurring limited overhead is also a pressing issue, especially when running big data analytics on the cloud.

(a)

(b) Figure 1. The Unified System for Big Data Analytics

Making modification to RDBMS code base here and there is awkward, it is the time to rewrite the code from bottom up. The new unified system can be built around next generation of HDFS/MapReduce [15], RCFile is a great work on the right track, but much work needs to be done. Various types of data can be stored in HDFS through applying different structures to HDFS blocks, the data types include relational data, XML, graph, RDF, high dimensional array, trajectory data and GIS data, as well as unstructured data such as document, audio, video, email, etc. One of biggest challenges in processing unstructured data is giving structure to unstructured data. Features extracted from unstructured data can be stored in the same file system using a structured format, the linking between the features and the unstructured data can be easily setup and maintained. Some important issues deserve further study, for example, what data layout is the most optimal one for different types of data, what index techniques can be used to improve data access performance, how to bridge unstructured and structured data for together analysis [16], how to leverage the property of HDFS of three-copies for a block to improve storage layer performance etc.

C. The Interface Layer The system should provided different interfaces for users with different skilled levels, including junior users, senior users& data scientists, and programmers. Junior users may prefer a declarative language like SQL [17] for simple analytics, mainly summations and aggregations. Even operating on graph data can use some form of declarative language. We haven't seen such a language yet and hope that it will appear soon. Junior user are not necessarily satisfied with simple analytics alone, some complex algorithms including machine learning that can be expressed in some declarative language would be a great help for them to understand the data at hand(such as systemML from IBM,

B. The Scheduling and Execution Layer On the Scheduling and Execution Layer, the toughest issue to tackle is optimization. There have been some research works on optimization techniques for MapReduce, including optimized job/task scheduling techniques for multi-core CPU, GPU, heterogeneous environments, and cloud platform, and optimized execution techniques for joining algorithms, data mining and machine learning algorithms [4]. Although some works have been done on job scheduling, the works are dispersive. Putting these pieces into a holistic

719

a master node rewrites the query and dispatches it onto data nodes, data nodes perform local aggregation in parallel which will not incur data transmission cost. Partial aggregation are collected and merged later on the master node to generate the global aggregation, some residual joining is needed to look up data in dimensional tables. Since the result of an OLAP query is usually very small, the final joining work will not be a performance bottleneck.

http://www.almaden.ibm.com/cs/projects/systemml/). As for senior users and data scientists, they may need deeper analytical algorithms to extract insights from the data, they are often familiar with models, but unfamiliar with programming, an easy to used high level procedural language like R [18] is suitable for them. Many algorithms in R should be fully parallelized to exploit the computation power of a large cluster. When SQL and R can not meet the analytic requirement of some people, skilled programmers come into the sight, they use SDKs like mahout [19] to create software which contains complex algorithms to meet special needs, and the languages they use can be C/C++, Java, or Python.

2) The Hierarchical Encoding Scheme We use an imaginary star schema here, which includes a fact table revenue and two dimension tables – date dimension and customer dimension. The hierarchical encoding scheme is illustrated using the customer dimension table. There are four levels of hierarchy in the customer dimension, namely Region->Country->City->Customer. We encode each member on some hierarchical level to a bit string using the local domain as depicted in the following figure (it is simplified for illustration purpose). Under the node of “Asia” on the Region level, there are only two members, the local domain of the “Asia” node is [China, Korea]. Since there are only two members, the values can be encoded using only 1 bit, “China” is encoded to 0, and “Korea” is encoded to 1. Generally speaking, when there are M members in a local domain, we can encode the members using only bits. In the figure we purposely limit the number of members of each node on every level to 2, so 1 bit is enough to encode the members.

D. Misc Components Several other components are needed, including visualization tools, exploratory query tools, and an integrated development environment that helps users to organize the dataset, create the model, and perform the analysis. Management and monitoring facilities are also needed. V.

OUR ONGOING ENDEAVORS TO PROVIDE SOME BUILDING BLOCKS FOR THE UNIFIED SYSTEM

Our works concentrate on online analytic processing (OLAP). A. Hierarchical Encoding of Star Schema Data in HDFS Blocks for Highly Scalable OLAP with Hadoop [20] 1) Motivation The data in a data warehouse for OLAP applications is usually organized using a star schema with a fact table being surrounded by a number of dimension tables. To answer OLAP queries posed on the data, joining of the fact table and the dimension tables is needed. When the volume of the star schema data reach some threshold, processing the data using a single node would be difficult. A scaling out methodology should be adopted by leveraging computation power of a cluster of commodity computers not only for performance but also for economic reason. To facilitate the joining between the fact table and dimension tables, dimension tables could be replicated to every node in the cluster, joining could then be done locally on every node. The method will incur significant extra storage space cost. For instance, to distribute 5GB of dimension data and 1TB of fact data over one hundred nodes, the dimension data will occupy 500GB of space (5GB*100), which is almost half of the space consumed by the fact table (the fact table is horizontally partitioned and distributed onto the cluster). More nodes, more additional space will be consumed. Some works do an on-the-fly joining, including one-to-many shuffling strategy [21] and broadcast join [22], but these methods may suffer from the problem of too much communication cost. Based on our observation that most star queries perform an aggregation along dimensional hierarchies, we try to encode the hierarchical information of dimensions tables into the fact table to replace the foreign key. Then the fact table is partitioned and distributed onto a cluster of arbitrary number of data nodes. To answer a query,

Figure 2. The Hierarchical Encoding

A dimension hierarchical code is the concatenated values of local hierarchy codes of the nodes along the path from root to the lowest level node. For example, the code for “Beijing” node is “000”, and customers belonging to the “Beijing” city have their local domain codes, the dimension hierarchical code for each customer in the city of “Beijing” could be constructed using “000” as a prefix. To guarantee a fixed size of sub bit-string in the final dimension encoding, the largest cardinality of local domains in a hierarchical level is used in encoding. For example, if “America” has 4 members, then the members under “Asia” also used 2 bits to encode the values. A compound dimension code is created by combing the dimension hierarchical codes of all dimensions. The compound dimension code is then used to replace the foreign key in the fact table. Compared to BLINK [23] and some other universal relation techniques [24], the hierarchical encoding scheme incurs much less space overhead, since only hierarchical information is encoded into the fact table. 3) The TRM Execution Model

720

The execution model includes three major steps: query transformation, reduce, and merge. We adopt a sharednothing system architecture, all dimension tables are stored on the master node, and the fact table is partitioned across slave nodes (data nodes) evenly. After above mentioned preprocessing, the fact table has contained all dimension hierarchy information which is enough for most data warehouse-style aggregation queries. When a query arrives, the master node rewrites (Query Transformation) the query to a new query which is operating only on fact table, then the rewritten query is scattered onto data nodes(salve nodes). Data nodes execute the query on its local data (Reduce) and transfer the results containing partial aggregations to the master node. Since the fact table has contained all dimension hierarchy information which is enough for most data warehouse-style aggregation queries, there is no need to join dimension tables and the fact table during local aggregation on data nodes, which will speed up query processing greatly. To improve the merge performance, local aggregations are sorted on Group By columns before transferring to the master node. Lastly the master node merges the collected results, looks up needed information in dimension tables, executes the Having-clause, sorts the data if necessary and returns the final results to users (Merge).

platform, by applying the encoding scheme to HDFS blocks, we expect that the performance of OLAP processing in Hadoop will be boosted, while the nice properties of Hadoop of highly scalability and fault tolerance, are almost fully retained. We have implemented the scheme using Hadoop 0.20.2. We generated 500GB of SSB (Star Schema Benchmark) data and distributed it onto a cluster of 14 nodes, one serving as the Name Node and the others as the Data Nodes. Each node has a 2GB memory, an Intel Core2 Duo 1.87GHZ processor and 300GB disk space. Each of them runs the 32-bit Ubuntu 10.10. We compare the performance of SSB queries on Hadoop, HadoopDB (we can not get Hadapt source code), and our prototype system of Dumbo, the result is shown in the figure. From the figure we can see Dumbo outperforms Hadoop and HadoopDB greatly. Only the queries that aggregate one measure are listed here, queries that operate on more than one measure achieve similar performance boost over Hadoop and HadoopDB (please refer to [20]).

Figure 4. SSB Query Performance( Dumbo vs. HadoopDB/Hadoop, Scale Factor=500)

Figure 3. The TRM Execution Model

The query transformation module relies on several rules to carry out transformation of equation predicates, range predicates, LIKE predicate, as well as IN predicate. When transforming an equation predicate, the corresponding local hierarchy code is extracted by a mask, then the bit-wise AND result between the fact table tuple and the mask is compared with the constant. For example, “dcustomer.region=’ Asia’ and dcustomer.country=’China’ and dcustomer.city=’Beijing’” is transformed into “t & .......111…… = ……000......”, t denotes the tuple, the sub bitstring of “111”is to extract the region/country/city hierarchy level information from a row( tuple), the sub bit-string of “000”is really the hierarchical code along the Customer hierarchy from “Asia” to “China” and to “Beijing”. Range predicates also could be easily transformed using a paradigm like “tuple & extract_bit_string between low_bound and high_bound”. Predicates on different dimensional hierarchies can be executed in one pass of data scanning. Users can refer to the full version of the technical report [20] for details of the transformation rules.

Dumbo needs three steps to load the data, firstly it generates surrogate keys (the compound dimension codes, which are used to replace the foreign key in the fact table) and inserts them into fact table; secondly, it sorts data based on the surrogate keys; finally, it writes the sorted data to data nodes. The loading time of Dumbo is between those of Hadoop and HadoopDB. After an investigation we found that the overhead is mainly incurred by the sort operation which accounts for 54.5% of the total cost of data loading, we believe that there is some room to improve it. As for space overhead, without applying compression, Dumbo consumes 12% more space than Hadoop and 35% more space than HadoopDB.

4) Experiment Results The TRM execution model fits the Hadoop framework quitr well, based on our early verification of the idea using PostgreSQL [25], we have migrated the scheme to Hadoop

Figure 5. SSB Data Loading (Scale Factor=500)

721

Figure 6 illustrates the motivation of this work. It shows two typical scan queries over the lineitem table in TPC-H benchmark (one accessing four columns and the other accessing sixteen columns) which incur very different execution time in the two models. PAX-Store and Pure Column-Store, can significantly outperform the other in some particular circumstances. It is unrealistic to identify the optimal storage model before hand. There is a need of a data processing system that can adapt its storage model to incoming queries dynamically.

B. Different Layouts for Different Replicas of a HDFS Block for High Performance of OLAP with Hadoop [26] 1) Motivation The replicas of a HDFS block are mainly used for fault tolerance purpose. In this work, we try to apply different layouts to different replicas of a HDFS block for higher performance of OLAP processing, while maintain fault tolerance property of HDFS. MapReduce can work on thousands of computer nodes and handle peta-bytes of data. During data processing, disk I/Os and network transmissions are the main factors determining the overall performance. Therefore, in recent years, a number of sophisticated storage models of RDBMS have been transplanted to MapReduce platform [7] [27] to achieve efficient data access. As demonstrated by many studies, column-store is a favorable storage model, as it outperforms PAX-Store in data analysis in most cases. PAX-Store is a simple way to implement a column-store. It partitions a relational table horizontally into units. Within each unit, the data is further partitioned by columns, so that the values of each column are stored consecutively. This kind of column-store is convenient to implement over Hadoop. It is able to achieve high compression ratio. As all the values of each tuple are stored within the same unit, it does not incur additional cost to reconstruct a tuple. Such a construction of column-store was first introduced in [28], known as PAX. Some recent MapReduce storage models, such as RCFile [7], have adopted this mechanism. A disadvantage of this approach is that it cannot avoid redundant data access when a subset of the columns are required, the unit has to be retrieved completely. A Pure Column-Store stores the columns of a table separately, so that each column can be located and processed independently. Using this construction, one can avoid reading unnecessary columns during query execution. High compression ratios can also be achieved by compressing each column within their own data domain [29] [30]. However, a Pure Column-Store may incur high overhead to reconstruct a data tuple, especially when the fields of the tuple are distributed onto different nodes. To minimize the cost of tuple reconstruction, some recent work [27] has attempted to co-locate the column blocks belonging to the same set of tuples on the same node. Nevertheless, tuple reconstruction remains as an expensive process.

2) Different Storage Layouts for Different Replicas By utilizing HDFS’s replication mechanism, we store different replicas of a data block in different storage models (layouts) – some replicas are stored in Pure Column-Store, while the others are stored in PAX-Store, the technique is named HC-Store. For each map task, we use a cost model to decide which replica to access to achieve the best performance. Such a hybrid system can adapt to different types of workload automatically. Engineers no longer need to predetermine the storage model for their applications. In fact, this storage model can be implemented in the application level, without modifying the existing MapReduce platform. A file of HC-store has at least two replicas, one in the PAX model and the other in the Pure Column model. By default, we store two-thirds of replicas of a block in the Pure Column-Store and the rest in the PAX-Store, as the Pure Column-Store outperforms the PAX-Store in more cases. This default ratio is tunable. Workload MR0 MR1 MR2 Router



Cost Model HDFS Blocks

Replica 0 PAX-Store

Replica 1 Pure Column-Store



Figure 7. HC-Store and the Router (MR: MapReduce)

To implement a Pure Column-Store on MapReduce, it is not sufficient to simply store each column as a separate file. We have to consider some more sophisticated mechanisms. As HDFS cannot guarantee that the fields of the same tuple be located in the same data node, it will incur extra network transmission cost to reconstruct a tuple. Although the authors of [27] have designed a column-store on MapReduce which stores the records of different columns on the same node, their implementation is not enough for our HC-store. For instance, we need to ensure that the Pure Column-Store replica and the PAX-Store replica should not be located in the same node. Otherwise, the fault-tolerance ability of MapReduce will be impaired. Moreover, we need to store the

Figure 6. Performance of PAX-Store and Pure Column-Store

722

meta-data that maps the columns between different storage models. For each incoming MapReduce job, the system estimates the access cost of the two storage models using the statistics collected during the data loading phase. It then chooses the model which can result in the optimal performance. The map tasks are thus assigned to the corresponding data splits.

for the overall performance of Hadoop, our cost model should consider not only the cost of each MapReduce job independently, but also the workload distribution in the cluster. To estimate the workload of a storage model, we take all the running tasks on this model and its replica factor into consideration. (5) Summing up above mentioned components, we can get the final global cost estimation equation of the two storage models. For more details of the global cost estimation (equations and descriptions) as well as local cost estimation, please refer to the reference of [26]. z Local cost estimation The situation changes when data failure or task failure occurs. To re-execute a task, the data access has to involve network transmission. For large clusters, there are two types of network access paths: inter-rack access and intra-rack access. In most cases, intra-rack access is much cheaper than inter-rack access. We remove the job startup cost and replace the overall I/O rate R in the global cost estimation equation with a pure network transmission rate Rnw, because local access becomes a rare case in this circumstance. In local cost estimation, load balancing problem is not considered.

3) The Cost Model Compared to traditional cost models adopted in RDBMS, our cost model is relatively simpler. However, this cost model has to take into account a number of new issues imposed by MapReduce. (1) Firstly, the computing environment of MapReduce in a cluster is dynamically changing, as the computing nodes can be unreliable and the network topology can be complex. As such, when a map task fails and needs to be re-executed on a different node, the new node may have to access the data remotely. In this case, an intra-rack access is usually cheaper than an inter-rack access, as the network transmission is a more determinant factor than the storage model. (2) Secondly, maintaining the load balance during the execution of multiple MapReduce jobs becomes an important issue. If most tasks are routed to the blocks in one particular storage model, they may overload the nodes storing those blocks. Therefore, even though some tasks fit one storage model better, it may be wiser to route them to the other storage model for the sake of load balancing. Cost estimation is divided into two parts: the global cost estimation and the local cost estimation. z Global cost estimation The global cost estimation is adopted when a MapReduce job is submitted. In this phase, most mappers process data locally using the MapReduce mechanism. Our cost model will estimate the average I/O cost of each map task and the workload on each storage model, and then calculate the global cost to choose the access path that is optimal for most mappers. (1) Full Scan or Index Scan

(3)Cost for Pure Column-Store / PAX-Store

(5)Load Balance to Avoid Overloaded Nodes

(2)Data Access Cost (Local IO/ Network Cost)

(4)Decompression Cost

Final Cost Estimation for Query Routing

4) Experiment Results We have implemented a prototype of HC-Store within HDFS of Hadoop 0.21.0. We deployed Hadoop 0.21.0 and Hive 0.7.1 in a cluster of 9 computing nodes, one of the nodes served as the Name Node and the others served as the Data Nodes. These nodes are really a subset of the nodes already described in section V (A). We loaded a 300GB TPC-H dataset both with and without using the GZip algorithm to compress the data onthe-fly. In all storage models, we maintained two replicas of each data block. The loading times are shown in Figure 9. For compressed data, HC-store and the PAX-Store are slower than the Row-Store and the Pure Column-Store. This is because that HC-Store wrote the data into two formats without utilizing pipeline parallelism, which introduced more random I/O cost.

Figure 8. Building up the Cost Model for Query Routing Figure 9. Comparison of Data Loading Time

The procedure of global cost estimation is described as follows: (1) we decide on full sequential scan or jump scan using an ancillary index according to the selectivity of a query. (2) Data transmission rate is calculated according to the probability of a replica being local for a mapper, being intra-rack for a mapper, and being inter-rack for a mapper. (3) We then model the cost of accessing Pure Column-Store and PAX-Store based on (1) and (2). For compressed data, additional decompression cost is also taken into consideration. (4) Since load balance is an important factor

The storage space consumed by the models is shown in Figure 10. When the TPC-H data were uncompressed, all the column-oriented stores occupied a little more space than the row store. This is because all the column-oriented stores need to keep extra metadata for tuple reconstruction. However, for compressed TPC-H data, the column-oriented stores were more space-efficient than the row store, as they achieved better compression ratios.

723

HC-store outperforms the others. In principle, HC-Store can outperform the PAX-Store and the Pure Column-Store in efficiency, as it can choose the optimal data access strategy for each individual task.

Figure 10. Comparison of Storage Space Consumed

z

Performance of Aggregation We ran TPC-H Q6 on the lineitem table, and varied the number of columns used by Q6. When the percentage of the columns involved in the query was less than 60%, the Pure Column-Store exhibited the best performance, because it can avoid accessing a significant amount of data that is not required by the query. Otherwise, the PAX-Store outperformed the others, as tuple reconstruction is less costly in the PAX-Store. In all cases, HC-store can achieve the near-optimal performance by routing each map task to the optimal data store.

Figure 12. Join Performance of Different Storage Models

Our work is based on some previous works and different from them. Trojan layout [31] proposed to organize different replica of a block according to different column-group based on input workload pattern, it depends on a workload to determine the storage in each block, which means that it is more applicable to areas that the workloads are less changeable. HC-Store is more flexible in this aspect, it does not need to predetermine the workload. Pure Column-Store has been around for decades [28], our work is based on the technique, HC-Store differs in that it is targeted for the distributed file systems built on unreliable clusters. Our idea of HC-store was also inspired by the several early works [32] [33]. To mention here is fractured mirrors [32]. Fractured mirrors store data in both row-store and column-store on RAID. Different from the work, we store data on the distributed file system of Hadoop, where nodes may be less reliable. Issues such as data recovery and column co-location have to be considered. Our work resemble the Las Vegas project of Brown University in that both of the works try to leverage the replicas to speedup MapReduce processing, in the work of [34] a replica can choose its own partitioning and sorting methods in order to speed up different sets of queries, our work using a mix of PAX-Store and Pure Column-Store for high performance of OLAP processing, a cost model is developed for query routing.

(a) Uncompressed Data

VI. CONCLUSION When we talking about data analytics, MapReduce and RDBMS serve the same purpose, MapReduce can do even much more. An ecosystem built around RDBMS for data analytics has been there for decades. Another ecosystem surrounding MapReduce is emerging. MapReduce and RDBMS are complementary to each other. We envision that a unified system will come into being and take over the task of big data analytics not long in the future. The system features running SQL analysis, data mining, machine learning, and other complex analytic algorithms over structured and unstructured data of big volume, in a real time manner with the help of in memory processing, for valuable insights on a single platform. The goal can not be achieved over one night, some interesting issues remain open for people to bring forward effective and efficient techniques.

(b) Compressed Data Figure 11. Aggregation Performance of Different Storage Models

z

Performance of Join We used TPC-H Q10 to evaluate the join performance of the different storage models. Q10 needs to join the customer table (accessing seven-eighths of the columns), the lineitem table (accessing one-forth of the columns) and the orders table (accessing one-third of the columns). HC-store can identify the best data access paths for all the tables. In the query execution, the lineitem table and the orders tables were accessed through the Pure Column-Store, while the customer table was accessed through the PAX-Store. Thus

724

[14] Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, Erik Paulson. Efficient Processing of Data Warehousing Queries in a Split Execution Environment. SIGMOD 2011, pp.1165-1176. [15] Hortonworks to Deliver Next-Generation of Apache Hadoop. http://www.businesswire.com/news/home/20120119005825/en/Horto nworks-Deliver-Next-Generation-Apache-Hadoop. 2012. [16] Byung-Kwon Park, Il-Yeol Song. Toward total business intelligence incorporating structured and unstructured data. International Workshop on Business Intelligence and the Web 2011, pp.12-19. [17] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy. Hive A Warehousing Solution Over a MapReduce Framework. PVLDB, 2009, 2(2): 938-941. [18] Sudipto Das, Yannis Sismanis, Kevin S. Beyer, Rainer Gemulla, Peter J. Haas, John McPherson. Ricardo: Integrating R and Hadoop. SIGMOD 2010, pp. 987-998. [19] Cheng T. Chu, Sang K. Kim, Yi A. Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, Kunle Olukotun. Map-Reduce for Machine Learning on Multicore. NIPS 2006, pp. 281-288. [20] Huiju Wang, Shan Wang, Xiongpai Qin, Furong Li, Xuan Zhou, Zuoyan Qin, Qing Zhu. Efficient Star Query Processing on Hadoop – A Hierarchy Encoding based Approach. Technical Report HiDB2012-003, High Performance Database Lab, Information School, Renmin University of China. 2012. [21] David Jiang, Anthony K. H. Tung, Gang Chen. Map-join-reduce: Towards scalable and efficient data analysis on large clusters. TKDE 2011, 23(9): 1299-1311. [22] Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian. A Comparison of Join Algorithms for Log Processing in MapReduce. SIGMOD 2010, pp. 975-986. [23] Knut Stolze, Vijayshankar Raman, Richard Sidle, O. Draese. Bringing BLINK Closer to the Full Power of SQL. BTW 2009, pp.157-166. [24] Tzy-Hey Chang, Edward Sciore. A Universal Relation Data Model with Semantic Abstractions. Journal of IEEE Transactions on Knowledge and Data Engineering, 1992, 4(1):23 – 33 [25] Huiju Wang, Xiongpai Qin, Yansong Zhang, Shan Wang, Zhanwei Wan. LinearDB: A Relational Approach to Make Data Warehouse Scale like MapReduce. Proceedings of DASFAA 2011, pp.306-320. [26] Huiju Wang, Furong Li, Xuan Zhou, Yu Cao, Xiongpai Qin, Jidong Chen, Shan Wang. HC-Store: Putting MapReduce's Foot in Two Camps. Technical Report HiDB-2012-007, High Performance Database Lab, Information School, Renmin University of China.2012. [27] Floratou A., Patel J.M., Shekita E.J., Tata, S. Column-oriented storage techniques for MapReduce. PVLDB 2011, 4(7):419–429. [28] Copeland G.P., Khoshafian S.N. A decomposition storage model. SIGMOD 1985, pp.268-279. [29] Abadi D.J., Madden S., Hachem N. Column-stores vs. row-stores: how different are they really? SIGMOD 2008, pp.967-980. [30] Stonebraker M., Abadi D.J., Batkin A., Chen X., Cherniack M., Ferreira M., Lau E., Lin A., Madden S., O’Neil E., O’Neil P., Rasin A., Tran N., Zdonik S. C-store: a column-oriented dbms. VLDB 2005, pp.553-564. [31] Jindal A., Quiane-Ruiz J.A., Dittrich J. Trojan data layouts: right shoes for a running elephant. SOCC 2011, pp.1-14. [32] Ramamurthy R., DeWitt D.J., Su Q. A case for fractured mirrors. VLDB Journal, 2003 12(2): 89–10. [33] Sunita Sarawagi, Michael Stonebraker. Efficient Organization of Large Multidimensional Arrays. Proceedings of ICDE 1994, pp.328 336. [34] Data Management Research Group, Brown University. Las Vegas: Using Replication to Speed-up MapReduce. http://database.cs.brown.edu/projects/las-vegas/. 2012.

VII.

ACKNOWLEDGEMENTS & BACKGROUND INFORMATION Thanks for funding support from the Important National Science & Technology Specific Projects of China under Grant No. 2010ZX01042-001-002 & Grant No. 2010ZX01042-002-002-03, NSF of China under Grant No. 61170013. This research work is also a part of the joint research project with EMC Labs China, which is funded by EMC Global CTO Office. Professor Sa Shi-Xuan (1922 2010) was a founder and leader of database education and research in China, the Sino-Australia joint Research Center for Big Data Management and Analytics is named after him for his distinct contribution. The research center is led by Professor Xiaofang ZHOU from Queensland University of Australia. Huiju WANG and Furong LI are a Post-Doc researcher and a PhD candidate respectively in Computer Science Depart. of National University of Singapore in 2012. REFERENCES [1]

[2]

[3] [4]

[5]

[6] [7]

[8] [9]

[10]

[11]

[12]

[13]

Brian Hopkins. Blogging From the IBM Big Data Symposium - Big Is More Than Just Big. http://blogs.forrester.com/brian_hopkins/1105-13blogging_from_the_ibm_big_data_symposium_big_is_more_than_ju st_big. 2011. Marko A. Rodriguez. MySQL vs. Neo4J on a Large-Scale Graph Traversal. http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-ona-large-scale-graph-traversal/. 2011. Jeffrey Dean, Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. OSDI 2004, pp.137-150. Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, Bongki Moon. Parallel data processing with MapReduce: a survey. SIGMOD Record, 2011, 40(4):11-20. Jimmy Lin, Michael Schatz. Design patterns for efficient graph algorithms in MapReduce. Eighth Workshop on Mining and Learning with Graphs 2010, pp.78-85. Tim Kaldewey, Eugene J. Shekita, Sandeep Tata. Clydesdale: Structured Data Processing on MapReduce. EDBT 2012, pp.15-25. Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, Zhiwei Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. ICDE 2011, pp. 1199-1208. Natalie Gruska, Patrick Martin. Integrating MapReduce and RDBMSs. CASCON 2010, pp. 212-223. Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Maden, Erik Paulson, Andrew Pavlo, Alexander Rasin. MapReduce and parallel DBMSs: friends or foes? Communications of the ACM, 2010, 53(1): 64-71. Eric Friedman, Peter Pawlowski, John Cieslewicz. SQL/MapReduce: A practical approach to self describing, polymorphic, and parallelizable user defined functions. PVLDB 2009, 2(2):1402-1413. Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton. MAD skills: new analysis practices for big data. PVLDB, 2009, 2(2):1481-1492. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2009, 2(1): 922-933. Azza Abouzied, Kamil Bajda-Pawlikowski, Jiewen Huang, Daniel J. Abadi, Avi Silberschatz. HadoopDB in Action: Building Real World Applications. SIGMOD 2010, pp. 1111-1114.

725