Join Optimization in the MapReduce Environment for Column-wise Data Store Minqi Zhou† , Rong Zhang§ , Dadan Zeng† , Weining Qian† , Aoying Zhou† Engineering Institute, East China Normal University, Shanghai 200062, China. § National Institute of Information and Communications Technology, Kyoto 619-0289, Japan. {mqzhou,ddzeng,wnqian,ayzhou}@sei.ecnu.edu.cn,
[email protected] † Software
Abstract— The chain join processing which combines records from two or more tables sequentially has been well studied in the centralized databases. However, it has seldom been discussed in the cloud computing era, and remains imperative to be solved, especially where structured (or relational) data are stored in a column (attribute) wise fashion in distributed file systems (e.g., Google File System) over hundreds of or even thousands of commodities PCs. In this paper, we propose a novel method for chain join processing, which is one of the common primitives in the cloud era for column-wise stored data analysis. By effectively selecting the dedicated records (tuples) for the chain join based on the information exploited within bipartite join graph, communication cost for record transmission could be reduced dramatically. A bushy tree structure is deployed to regulate the chain join sequence, which further reduces the number of intermediate results generated and transmitted, and explores higher parallelism in join processing, while results in more efficient join processing. Our extensive performance study confirms the effectiveness and efficiency of our methods.
I. I NTRODUCTION The notion of cloud computing becomes a well-known buzzword nowadays due to the success of a number of companies, such as Google, Amazon, Salesforce, Facebook and so on, in effectively and efficiently managing tremendous large volumes of data. Herein, cloud computing refers to both the applications delivered as services over the Internet and the hardware as well as system software in the data centers that host those services [1]. The current underway revolution of cloud computing is to reconstruct systems with tens of thousands of commodity PCs or servers instead of traditional mainframes to provide Internet-scale services as well as to achieve the utility computing [2] as their ultimate goal. In such kind of systems, scalability becomes the mantra. Google’s MapReduce [3] framework for parallel computing achieves wonderful scalability, say nearly linear scalability (i.e., resources increased in a system, resulting in increased performance in a manner proportional to resources added). Acquiring such scalability, it often compels applications to replicate information at several other sites and manage it with efficiency, which are the main functionalities of Google File System (i.e., GFS) [4], which sets 3 as the default replication number (i.e., replicated at 3 different sites). Consequently, those systems enhance their fault-tolerance, availability and performance. It is the very reason why those systems are able to provide services with high availability and wonderful user experiences over a large number of users.
In this contemporary information exploration era, data are generated in an amazing pace. IDC estimates that 161 exabytes are generated in 2007 all over the world, and 988 exabytes (closing in on 1 zettabytes) are about to be generated in 2010 [5]. Effectively and efficiently analysis of data in tremendous large volumes becomes promising as well as knowledge extraction for business intelligence and resource planning. In such circumstances, historical data along with data from multiple operational databases are involved, and are loaded in bulk into the system once but accessed multi times for diverse data analysis purposes. That’s to say this data analysis is tend to be read-most, or even read-only. Column data store (Cstore in short) [6] is optimized for such dedicate application scenarios. As being self-evident of its name, column-store stores structured (or relational) data in the column (attribute) wise format, which is different from traditional row-store that stores structured data in the row (record) wise format. Comparing with traditional row-store strategies, column-store is able to trade off CPU cycles for disk bandwidth, because of its natural advantage in data compression and selection on single column (attribute), especially for analysis (readmost) tasks. Intuitively, hosting the column-store data in the distributed file systems (e.g., GFS) and analyzing under the parallel processing framework (e.g., MapReduce) meet the requirement of the contemporary data analysis purpose (e.g., decision making, resource planning). In terms of the data model given in C-store [6], tables are projected into a set of tables with fewer columns, denoted as projections, which are stored as files in the distributed file systems. In this case, heavy relational joins among those projections can’t be avoid for data analysis. Hence, efficiently and effectively data analysis leveraged on this kind of join operator, where chain join [7] becomes the main join style. Herein, the chain join processing refers to combine records from two or more projections sequentially. In this paper, we address the problem of efficient chain join processing in the MapReduce environment for the structured (relational) data stored in the column-wise format. It can be formally defined as follows: given a large volume of structured data (e.g., several tera-bytes or even peta-bytes in capacity) which are projected into a set of projections with fewer columns, each of which is stored under columns (attributes)wise format in the distributed file system (e.g., GFS [4], HDFS [8]) over hundreds of or even thousands of shared-
noting commodity PCs, the task is to provide an efficient chain join primitive over those separated projections in the MapReduce framework. We mainly focus on the exploration of the parallelism of the system and the reduction of the network bandwidth usage in the join processing. The main contribution of this paper includes: ∙ A bipartite graph, which is compatible with the MapReduce parallel processing framework in nature, is deployed to optimize the data transmission in the chain join processing. Herein, bipartite graph is capable in selecting the dedicated tuples to be transmitted to the destinations, and resulting in dramatically communication cost reduction in comparison to the existed work. Hence, we name it as the bipartite join selection. ∙ The chain join sequence is regulated to further reduce the intermediate join results, which needs to be transmitted for next join step in the chain, and to explore further the parallelism in chain join processing by bushy tree based on the obtained bipartite join selection information. ∙ We conduct an extensive performance study which shows the effectiveness and efficiency of our algorithm for chain join primitive in the MapReduce environment over hundreds of commodity PCs. The rest of the paper is organized as follows. In Section II, we review the related works, and presented the problem statement followed by, in Section III. Based on bipartite graph, dedicated tuple selection in projections is given in Section IV, followed by the chain join sequence regulation in Section Section V. Extensive experiment study is given in Section VI. We conclude the paper in Section VII. II. R ELATED W ORKS In this section, we provide the related works to our paper, including column store and join processing in the parallel systems. A. Column Store Data analysis applications become prosperous nowadays, especially in the information exploration age. The common main property of these data analysis applications (e.g., data warehouse applications, customer relationship management, electronic library card catalogs) is read-mostly. Column store is originally designed for such kind of data analysis purposes and has been implemented in many commercial data warehouse systems already, such as Sybase IQ [9], [10], Addamark [11], and KDB [12] and so on. Recently, Stonebraker and etc [6] provides a column-oriented DBMS prototype which is further implemented to be a commercial version (VERTICA [13]). Extensive comparisons between row store and column store are given in [14], and column store is hopeful to be the future of the data analysis applications. B. Join Processing Analytical data management is rather prosperous in cloud computing era, while join processing is one of its basic primitives. Much of work has been done in the parallel database system literature, even in the cloud data management systems.
Sort-merge join [15], Simple hash join [16], Grace-hash join [17], Hybrid-hash join [16] are designed and implemented for parallel database systems to fully use the main memory in assisting the join processing. Their main differences rely on the strategies used in loading data into the memory. These algorithms have expected performances in the main memory limited systems, but their performances degrade in the cloud computing era. These join algorithms are redesigned and reimplemented in the MapReduce framework with additional Merge phase [18] to achieve higher parallel performances, but they never take the network bandwidth usage (communication cost) reduction into account, which becomes one of the main measurements in the cloud computing model [19]. Ullman and ect. [7] proposed a join algorithm in the MapReduce framework to join several relations at once, which is most relevant to our work. This algorithm extensively use the Map to send records participated in the join processing to a set of Reduces to ccomplish joining several relations at one, while it suffers from high communication cost, especially when the number of the relations to be joined is high (e.g., chain join). In this paper we propose a new algorithm for chain join in the MapReduce framework, which leverages on bipartite graph used to reduce the communication cost and chain join sequence regulation to explore higher parallelism. III. P ROBLEM S TATEMENT In this section, we state the problem to be solved in this paper, say the efficient implementation of the chain join primitive. The data model used in our paper is consistent with that used in C-store [6], which supports the standard relational logical data model as well. As for the self-containment of the paper, we describe the data model briefly as below: Like other relational systems, columns (attributes) stored in tables can form an unique primary key or a foreign key that references the primary key in another table. We implement projections, denoted as projections 𝑃𝑖 s, each of which is anchored on a given logical table, denoted as anchored table T. Every projection individually contains interested columns (attributes) which are projected from 𝑇 , and retains the duplications as those in the anchored table 𝑇 . Fig. 1 gives an example showing the original anchored tables and part of their projections in terms of individual interest on columns (attributes). Take the standard EMP(name, age, salary, dept) and DEPT(dname, floor) relations as an example, which are showing in Fig. 1. One possible set of projections for these tables could be EMP1(name, age), EMP2(name, salary, DEPT.floor), EMP3(dept,age), DEPT1(dname,floor). Projection EMP2(name, salary, DEPT.floor) as showing in Fig. 1. What should be noticed is that tuples in a projection are stored in column-wise. For example, tuples in projection EMP2(name, salary, DEPT.floor) in Fig. 1 are stored as Bob, Bill, Jill, Tom; 10K, 30k, 30K, 80K; 1, 2, 1, 3. Projections of the anchored tables are further stored in the distributed file system as files over hundreds of or even thousands of commodity PCs. In our system, Hadoop [20]
Name
Age
Dept
Salary
DName
Floor
Name
Salary
Bob
25
Math
10K
Math
1
Bob
10K
1
Bill
25
CS
30K
CS
2
Bill
30K
2
Jill
27
Math
30K
Bio
3
Jill
30K
1
Tom
27
Bio
80K
Chemistry
4
Tom
80K
3
EMP Anchored Table
Fig. 1.
DEPT Anchored Table
= 𝑃1 (𝐴1 , 𝐴2 ) ⊳⊲ 𝑃2 (𝐴2 , 𝐴3 ) ⊳⊲ . . . ⊳⊲ 𝑃𝑖 (𝐴𝑖 , 𝐴𝑖+1 ) ⊳⊲ . . . ⊳⊲ 𝑃𝑛 (𝐴𝑛 , 𝐴𝑛+1 )
(1)
where 𝐽𝑇 is the result table of the chain join, and 𝐴𝑖 is the 𝑖𝑡ℎ attribute name, and 𝑃𝑖 is the 𝑖𝑡ℎ projection of the anchored tables. Selection, which filters out the proper tuples, is an important operator assisting data analysis applications. The chain join operator supported in our paper should contain this capability in tuple selection (filtering). Therefore, the main task (e.g. selection enabled chain join) of this paper can be defined as the following relational algebra: 𝐽𝑇
=
EMP2(name,salary,DEPT.Floor)
Anchored Table and its Projections
is deployed both for data store management and application precessing coordinator, which is an open source framework developed in Yahoo! cloned from Google’s MapReduce and has been deployed in companies world wild. For the above two purposes, Hadoop involves the two corresponding components, which are called Hadoop Distributed File System (HDFS in short) [8] and Hadoop MapReduce [21] irrespectively. HDFS has a master/slave structure, which stores meta-data in the master node, denoted as name node, including file name, file path, file replications, and so on, and payload data in the slave nodes, denoted as data nodes. Hence, the projections of the anchored tables are stored over data nodes. Hadoop MapReduce has been proved to have wonderful parallelism and scalability by using the Map function to remove the collaboration between nodes and using the Reduce function gather the intermediate results together, whose idea is originated in the functional programming. Consequently, the Hadoop MapReduce is employed in our system for the join processing coordinator. As relations are stored in the projections which have fewer columns (attributes) than that of the anchored tables, join processing, or more preciously the chain join processing becomes the main primitive for any analysis applications over the relational data. We give the definition to chain join as below: Definition 1: Chain Join: A chain join is a join in the form of 𝐽𝑇
DEPT.Floor
𝑃1 (𝐴1 , 𝐴2 )𝜎1 ⊳⊲ 𝑃2 (𝐴2 , 𝐴3 )𝜎2 ⊳⊲ . . .
(2)
⊳⊲ 𝑃𝑖 (𝐴𝑖 , 𝐴𝑖+1 )𝜎𝑖 ⊳⊲ . . . ⊳⊲ 𝑃𝑛 (𝐴𝑛 , 𝐴𝑛+1 )𝜎𝑛 Relational data in the projections are distributed over hundreds of or even thousands of commodity PCs in charged by distributed file systems (e.g., HDFS). That’s to say, these
PCs have the capability and responsibility in storing data and processing the chain join coordinately. The main priorities of handling the chain join within such a environment are to reduce the network bandwidth usage for tuples transmission and to increase the parallelism in chain join processing. We propose a novel method by using the bipartite graph to select the dedicated tuples for transmission, which reduce the bandwidth cost dramatically, and by using the bushy tree to regulate chain join sequence, which increase the processing parallelism. Next, we present the two strategies the Section IV and Section V individually. IV. B IPARTITE J OIN S ELECTION As described in Section III, projections stored in the distributed file systems (e.g., HDFS) are distributed over hundreds of or even thousands of commodity PCs. Therefore, tuples in projections participated in the chain join processing are scattered all over these nodes with high probability, resulting in tuple transmission inevitable. To reduce the network bandwidth usage by selecting data tuples that must be sent and send them to the dedicate destinations are hopeful for the efficient and effective chain join processing. Bipartite graph is a graph which is capable in connecting two disjoint set (e.g., projections here) together. Herein, bipartite graph becomes a natural choice to emulate the join processing. Nevertheless, the chain join can be indicated by a set of bipartite graphs which are connected with each other from end to end. The connections between a set of bipartite graphs constructed in terms of the chain join processing further indicate the tuples to be transmitted and their destinations in detail. Next, we show this tuple selection algorithm in detail. A. Bipartite Join Selection Principle The entire tuple selection algorithm can be can be divided into two phases, denoted as bipartite join selection construction and bipartite join selection reduction, which are given in Fig. 2 and Fig. 3 individually for intuitive illustration. In principle, the phase of bipartite join selection construction is to emulate the chain join under a bipartite graph chain. Herein, each disjoint set within the bipartite graph chain corresponds to a projection in the chain join, while each element within the disjoint set is a tuple located in the corresponding projection. More important, the connections, which connect elements between to disjoin set in the bipartite graph chain, present the tuple join within this chain join. The construction of the bipartite graph goes briefly as follows. Firstly, tuple
selection is executed individually on each projection 𝑃𝑖 in terms of its corresponding selection function 𝜎𝑖 . Secondly, information for selected data tuples and its value on the joint column (attribute) in one projection, in a compressed format (e.g., bitvector), is declared to the projection followed by in the chain join sequence. Both sequential declaration (i.e., one by one) and concurrent declaration (i.e., in a parallel fashion) are supported. The connections between projections can be computed by the projections after they get the information transmitted from its previous projection in the chain. We denote this phase as bipartite join selection construction. Thus, each projection gets information for join from its previous projection. By matching the values on the joining attributes, the connections between the two projections can be formed. However, it contains lots of redundant connections, which will be reduced in the bipartite join selection reduction phase. Example 1: As shown in Fig. 2, each rectangle presents the selected data tuples in the projection which participates in the chain join, containing two or more columns (attributes), one (i.e., usually the first column in the rectangle) for the tuple id of the projection, and the others for the attributes where the join is taken on. For example, data tuples with id {1, 3, 4, 5, 7} in 𝑃1 are selected under function 𝜎1 , which have value {11, 17, 26, 26, 35} on 𝐴1 irrespectively. As for declaration, 𝑃2 is notified with information on selected tuples in 𝑃1 , which is in the compressed format (i.e., tuple ids in a bitvector format and values on 𝐴1 in a list format). Similarly, selected tuple information under function 𝜎2 for 𝑃2 is declared to 𝑃3 in the same manner, as indicated in Fig. 2. Followed by the bipartite join selection construction phase, it is the bipartite join selection reduction phase, which reduces the number of connections in the bipartite graph chain, and consequently resulting dramatic reduction in the number of tuples that must be transmitted. It is accomplished in the inverse sequence of the information declaration in the bipartite join selection construction phase, as shown in Fig. 3. The information on attribute values which have the matched values on joining attribute 𝐴𝑖 computed on a projection (e.g., 𝑃𝑖 ) will be declared to its previous projection (i.e., 𝑃𝑖−1 ) in the list format. Recursively, the declaration is accomplished in a sequential manner. Eventually, the bipartite graph with reduced connections is formed, which indicates the dedicated tuples in each projection participating the chain join. Hence, the optimal data transmission (i.e., the minimum bandwidth usage) can be achieved by selecting tuples and transmitting them to their destinations. Example 2: Take the bipartite join selection reduction shown in Fig. 3 as an example. 𝑃3 computes the matched values on 𝐴2 based on the information notified in bipartite join selection construction phase (i.e. {21, 27, 36, 51}) and values on 𝐴2 in 𝑃3 to be {21, 51}. 𝑃2 gets the notification (i.e., {21, 51}) and computes the tuples participated (i.e., {1, 5}) and the corresponding values on 𝐴1 (i.e. {11, 35}). Recursively, the same computing is executed on 𝑃2 site. Eventually, 𝑃1 determines tuple {1} to participate in the chain join. Based on the bipartite join selection principle, the total number
of tuples transmitted in the chain join is dramatically reduced. As an example shown in Fig. 2 and Fig. 3 , the total number of tuples needed to be transmitted is reduced to 3 from 9. Intuitively, lots of the communication cost can be reduced. Given the bipartite join selection in principle, we start to investigate its implementation in the MarReduce framework efficiently, which is presented in Section IV-B B. Bipartite Join Selection Implementation Based on the basic idea of the bipartite join selection, we start to investigate its implementation in the MapReduce framework in this section. Basically, two main issues must be considered in the implementation, say distributed storage and parallel processing. In terms of the bipartite join selection principle, two aspects of data are stored in the distributed file system (e.g., HDFS). One is the relational data separated in projections which are further stored in column-wise format. The other is the information data assisting tuple selection generated in the bipartite join selection construction and bipartite join selection reduction phases. Data in the projections are horizontally partitioned into segments, and the size of each partitioned segment equals to the block size of the distributed file system (e.g. HDFS) or is smaller. It is important to keep each tuple complete (i.e., each tuple in the segment contains values on all attributes in the projection) within the projection segment, otherwise lots of communication cost will be waste to reconstruct the tuple in the projection. The intermediate information is stored as files in the distributed file systems, which can be accessed in the bipartite join selection construction phase and reduce phase if necessary. In terms of the introduction to the bipartite join selection principle, its implementation in the MapReduce framework contains the two corresponding phases, with information gathering and information scatting as their main properties individually. In the bipartite join selection construction phase, selected data tuple information in each projection is gathered together within one MapReduce execution where the map function takes the projection identifier and the selected tuples inside as its input. Therefore, selected tuple information within one projection is gathered together. The implementation of this phase achieves high parallelism, since data in all projections participated in the chain join are distributed over and executed concurrently. As shown in Fig. 4, the data in all projections stored over distributed node are doing the information gathering simultaneously. For each MapReduce function, the first priority is to clarify the < 𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒 > pair, which determines how the data to be processed and gathered. In the bipartite join selection construction phase, the 𝑘𝑒𝑦 refers to be the projection identifier, while the 𝑣𝑎𝑙𝑢𝑒 is a data structure which contains three elements, i.e., {meta data of the projection segment, a bitvector indicating the selected tuples, the reference attribute values in the list format}. The meta data of the projection segment includes the information of node address, projection number, segment number. After this
P1 id
A1
1 3 4 5 7
11 17 26
bitvector 0 1 0 1 1 1 0 1
list 11 17 26 26 35
35
id
P2 A1
1 2 3
11
5
36
1
P3 bitvector list 0 1 21 1 27 1 36 0 51 1
A2 21 27 36
0 0
51
id
A2
A3
3 5 6
21
62
51
78 81 90
7
3
2
Fig. 2.
Bipartite Join Selection Construction P3
P2
P1 id
A1
1
11
list
id
A1
21
1 11
A2
list
21 51
11
id
A2
A3
3 5 6
21
62 78
51 5
Reduced
36
51
7
Redution on A1
Fig. 3.
81 90
Redution on A2
Bipartite Join Selection Reduction
MapReduce computation, information on the selected tuples in all projection segments over nodes is gathered together in terms of the projection identifiers, which is further stored in the distributed file system (e.g., HDFS) as files. The implementation of the bipartite join selection reduction phase is executed in a pipeline fashion under the inverse sequence of the chain join taking the advantage of MapReduce for parallel processing. This phase is used to select the dedicated tuples that participate in the chain join for each projection. On purpose of accelerating the selection reduction pace and enhancing the parallelism of its execution, it is processed in a chain. Take the processing in Fig. 4 as an example, theselected tuples in 𝑃2 under function 𝜎2 to be participated in the chain join with 𝑃3 are determined eventually by the information notified from 𝑃3 . Therefore, it can be implemented in MapReduce framework as follows: both selected data tuples in 𝑃2 and 𝑃3 are mapped on values of 𝐴2 as shown in Fig. 4, and get the matched tuples in the Reduce site and store the results to the HDFS as file. In this MapReduce, the 𝑘𝑒𝑦 in the < 𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒 > pair is the value of 𝐴2 , and the 𝑣𝑎𝑙𝑢𝑒 is the same as that used in the MapReduce described in the previous paragraph. The result of this MapReduce indicates the dedicate data tuples in 𝑃2 , which participate in the chain join with 𝑃3 . For the second MapReduce shown in Fig. 4 is the same, which is mapped on 𝐴1 , and get the dedicate tuples participated in the chain join in 𝑃1 . Clearly, the main purpose of the bipartite join selection algorithm is to select dedicated tuples in each projection to be participated in the chain join, and consequently resulting in optimized communication. Employing the MapReduce framework this bipartite join selection processing, we scales the
capability for massive data processing and achieves the high parallelism. V. J OIN S EQUENCE R EGULATION As showing Section IV, data tuples in all projections that participate in the chain join are selected preciously, which has potential in communication cost reduction. In this section, we explore the parallelism for chain join processing based on eventually those selected tuples based on the MapReduce framework. By regulating chain join sequence and reducing the volumes of generated intermediate join result, higher parallelism processing and cost-effective communication can be achieved. For comparison, we provide two chain join sequence regulation algorithms in this section, say linear join and bushy join, which will be introduced in detail irrespectively in the following subsections. A. Linear Join The linear join, which means to join all the projections participated linearly or in a pipeline fashion, is the actual chain join processing based on information constructed in the bipartite join selection progress, resulting in to be a natural choice to accomplish the chain join in the same sequence of it provided. The linear chain join is processed in the MapReduce framework, which is entirely compatible with the join algorithms proposed in Map-Reduce-Merge [18] (e.g., hash join, merge join), but solely for selected tuples. We omit the detailed algorithm here for its brevity. The shortcomings of this join algorithm are listed as follows: ∙ The entire chain join is processed in a linear or a pipeline fashion, and consequently resulting in long execution
Selection Construction
Selection Recduction
P1
id
A1
1 3 4 5
11 17 26
7
M
...... 35
M
R ......
M ......
M ......
R
M
M
σ1 P1
id 1 2 3
A1
A2
11
21 27 36
R
...... 5
36
R ......
M
51
M
R ......
M ......
M ......
R
M
M
σ2 P1
id
A2
A3
M
3 5 6
21
62 78
M
81 90
R ......
M ......
M ......
R
M
51 7
Map on Proection Number
Fig. 4.
∙
M ......
R
M
M
...... σ3
R ......
Map on A2
Map on A1
Bipartite Join Selection Implementation
time due to the long MapReduce startup time, especially for the long chains. The the parallelism of the join processing may be not fully explored, because of the joins in the rear of the chain are waiting for the intermediate results produced by the joins in the head of the chain. The intermediate results transmitted are in large volumes, because of they are enlarged in each join step of the chain. Take the intermediate chain join results shown in Fig. 3 as an example, the tuples need to be transmitted to its following projection is enlarged. Hence, breaking the linear chain join into a hierarchical join processing is hopeful in reduce the communication cost.
Intuitively, to process the chain join with higher effectiveness and efficiency, more parallelism should be explored and more intermediate results should be reduced. The bushy join algorithm proposed in Section V-B is hopeful in achieving the both above.
B. Bushy Join To explore higher parallelism in chain join processing and higher reduction in intermediate results, we describe the bushy chain join in detail in this section. The bushy join means to process the chain join in a tree manner. The bushier (lower) of the tree is, the higher parallelism and the higher reduction in intermediate results will be achieved. Fig. 5 gives an intuitive example, where the chain join is processed in the balanced binary tree manner and achieves the highest parallelism. Herein, the balanced binary tree is defined to be the bushiest tree. In this bushy join manner (e.g., balanced binary tree), adjacent two projections participated in the chain join are grouped together, and all the groups process the join operator concurrently. As shown in Fig. V-B, projections 𝑃1 and 𝑃2 as well as 𝑃3 and 𝑃4 are grouped together, and their join are taken place concurrently with intermediate results generated and stored in the distributed file system (e.g., HDFS). Recursively, join are taken on the grouped intermediate results, and
P1(A1,A2) P2(A2,A3) P3(A3,A4) P4(A4,A5) P5(A5,A6) P6(A6,A7) P7(A7,A8) P8(A8,A9)
Fig. 5.
eventually the chain join results are generated. Example 3: Take the bushy chain join shown in Fig.5 as an example. The chain join which joins 8 projections (i.e., 𝑃1 , 𝑃2 , . . . , 𝑃8 ) together sequentially. For bushy chain join, the 8 projections are grouped into 4 groups (i.e., {𝑃1 , 𝑃2 }, {𝑃3 , 𝑃4 }, . . . , {𝑃7 , 𝑃8 }) which join in a parallel fashion as the lowest level of the bushy tree. Thus, for intermediate joint results are generated and stored in the distributed file system (i.e., HDFS), based on which the join are taken place as the second lowest level of the bushy tree. Recursively, the chain join is completed when all the intermediate results are joint into one. Each parallel join in the bushy tree reduces the number of the intermediate result set into half. The number of tuples needed to be transmitted in each join processing is related to the number of the projections based on which the intermediate join result is computed. Intuitively, the lower level of the join in the bushy tree is, the fewer tuples needed to be transmitted. As all the projections in the same level of the bushy tree are grouped and joint simultaneously, the join execution time is reduced dramatically when comparing to the linear join. Obviously, the bushy join has much higher parallelism in the chain join processing and higher reduction in the intermediate results. VI. E XPERIMENTAL S TUDY In this section, we evaluate the performance of our proposed algorithm on chain join processing in the MapReduce framework for column-wise data store. The experiment is running on a cluster, which contains 40 blades (8 of them are Atom blades and the other 32 are x86 blades, 288 CPU cores in total), 40 Tera-bytes storage in total. The intra-rack network connection is 10000M/s. Each blade in the cluster runs Ubuntu with verion 9.10. Hadoop [20] with version 0.19 is deployed on this cluster above Ubuntu operating systems. TPC-H [22] is the good standard for decision support (data analysis over data in large volumes). In our experiment, we use the data generator to generate relational data with varies sizes in volume, from 1 GB to 50GB. The originally generated data are in row store, we project the relational data into a set of projections, each of which contains 3-5 columns (attributes). Each projection is further horizontally partitioned into a set of segments, each of which contains 64M data or less. In the end, tuples within one segment of one projection are changed into column-wise store from row-wise store. Eventually, all segment data for all projections are loaded into HDFS in
Bushy Join
bulk for chain join processing. The performance evaluation for our proposed chain join algorithm is mainly focus on the communication cost and the execution parallelism. A. Evaluation on Communication Cost In terms of the bipartite join selection principle provided in Section IV, the communication cost for each projection to decide which dedicated tuples participated in the chain join varies with the length of the chain join processing and the relational data volumes. As showing Fig. 6, the communication cost increases with the length of the chain join and the relational data volumes since more tuples will be participated, while decreases with the number of joined blades increase since more nodes share the storage of the relational data. B. A Comparison Study Column-Store has advantages in optimizing data read in processing the chain join, say fewer data will be read, when comparing with Row-store methodology. Fig. 7(a) shows this advantage of column store. Much more data read could be reduced when the length of the chain join increase. Ullman’s chain join algorithm achieves joining several relations at once by mapping data tuples to multi reduce nodes. The longer the chain join is, the more reduce nodes tuples will be sent to, resulting communication cost enlarged dramatically. However, the algorithm provided in our paper has advantages in selecting the dedicated tuples to be participated in the chain join progress. VII. C ONCLUSION In this paper, we present the design of bipartite chain join which selects dedicated data tuples to be participated in the chain join progress, resulting dramatic communication saving. Detailed analysis, together with extensive experiments prove that our bipartite chain join algorithm is efficient and valuable in processing the chain join the MapReduce framework for column-wise data store. In future, we plan to theoretical analysis to our proposed bipartite join algorithm. Acknowledgement: This work is partially supported by Shanghai International Cooperation Fund Project under grant No.09530708400, Alibaba Young Scholars Support Program Fund under grant No. Ali-2010-A-12), National Science Foundation of China under grunt No.61003069, No.60833003 and No. 60925008, National Hi-Tech 863 program under grant No.2009AA01Z149.
90
50
80
40
Data Transmitted per Nodes (MB)
Data Transmited per Node (MB)
45 4 Projections 5 Projections
35
6 Projectoins 7 Projections
30 25 20 15 10
20GB Data
70
30GB Data
60
50GB Data
40GB Data
50 40 30 20 10
5
10
20
30
40
10
20
Number of Blades
30
40
Number of Blades
(a) Communication Cost on Variable Chain Join (b) Communication Cost on Variable Length Data Volumes Fig. 6.
Communication Cost for Bipartite Join Selection
2000
Data Transimitted per Node (MB)
2500
Data Read per Node (MB)
1600
1200
Column-Store Row-Store
800
400 4
5
6
7
2000
Bipartite Join
1500
Ullman's Join
1000
500 4
Number of Projections
(a) Data Read Comparison with Row-Store
Fig. 7.
5
6
7
Number of Projections
(b) Data Transmitted Comparison with Ullman’s Join
Performance Comparison
R EFERENCES [1] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al., “Above the clouds: A berkeley view of cloud computing,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-28, 2009. [2] N. Carr, “The end of corporate computing,” MIT Sloan Management Review, vol. 46, no. 3, pp. 67–73, 2005. [3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” in OSDI, 2004, pp. 137–150. [4] S. Ghemawat, H. Gobioff, and S. Leung, “The Google file system,” in OSDI, 2003, pp. 29–43. [5] R. L. V. and L. D., “Storage Adoption and Opportunities in Telecommunications, 2008,” in IDC Reoprt, 2008. [6] M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, et al., “C-store: a columnoriented DBMS,” in VLDB, 2005, pp. 564–576. [7] F. Afrati and J. Ullman, “Optimizing Joins in a Map-Reduce Environment,” in EDBT, 2010. [8] Yahoo!, “Hadoop Distribted File System Architecture,” http://hadoop.apache.org/common/docs/current/hdfs design.html, 2008. [9] C. French, “One size fits all” database architectures do not work for DSS,” in SIGMOD, 1995, pp. 449–450. [10] Sybase, “Sybase IQ,” http://www.sybase.com/products/databaseservers/ sybaseiq, 2009. [11] Addamark, “Addamark,” http://www.addamark.com/products/sls.html, 2009. [12] Kx, “KDB+,” http://kx.com/Products/kdb+.php, 2009. [13] M. StoneBraker and etc, “VERTICA,” http://www.vertica.com/, 2009. [14] D. Abadi, S. Madden, and N. Hachem, “Column-Stores vs. Row-Stores: How different are they really?” in SIGMOD, 2008, pp. 967–980. [15] T. Corp, “DBC/1012 Data Base Computer Concepts & Facilities, Teradata Corp,” Document No.C02-0001-00, 1983.
[16] D. DeWitt, R. Katz, F. Olken, L. Shapiro, M. Stonebraker, and D. Wood, “Implementation techniques for main memory database systems,” in SIGMOD, 1984, pp. 1–8. [17] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka, “Application of hash to data base machine and its architecture,” New Generation Computing, vol. 1, no. 1, pp. 63–74, 1983. [18] H. Yang, A. Dasdan, R. Hsiao, and D. Parker, “Map-reduce-merge: simplified relational data processing on large clusters,” in SIGMOD, 2007, pp. 1040–1052. [19] F. Afrati and J. Ullman, “A New Computation Model for Cluster Computing,” in PODS, 2009. [20] Yahoo!, “Hadoop,” http://hadoop.apache.org, 2008. [21] Yahoo!, “Hadoop MapReduce,” http://hadoop.apache.org/common/docs/ current/mapred tutorial.html, 2008. [22] TPC-H, “TPC-H,” http://www.tpc.org/tpch/, 2009.