Generalized Query Processing Mechanism in Cloud Database Management System Shweta Malhotra, Mohammad Najmud Doja, Bashir Alam, Mansaf Alam Jamia Millia Islamia, New Delhi, India
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. This is an epoch of Big Data, Cloud computing, Cloud Database Management techniques. Traditional Database approaches are not suitable for such colossal amount of data. To overcome the limitations of RDBMS, Map Reduce codes can be considered as a probable solution for such huge amount of data processing. Map Reduce codes provides both scalability and reliability. Users till date can work snugly with traditional database approaches such as SQL, MYSQL, ORACLE, DB2 etc and they are not aware of Map Reduce codes. In this paper we are proposing a model which can convert any RDBMS queries to Map Reduce codes. We also gear optimization technique which can improve the performance of such Amalgam approach. Keywords: Cloud Database Management System, CDBMS, MapReduce, Hadoop, Optimized Algorithm, Cross Rack Communication, Data Processing.
1
Introduction
Cloud Database is based on Distributed, Parallel, Grid computing. There are several companies like Yahoo, Microsoft, Amazon and many more that are providing Cloud Database services. Cloud Database Service provided by the Cloud Provider is heterogeneous in nature. Heterogeneity is provided in the number of forms as follows and shown in Fig 1. • Types of users are different includes simple user, small scale organization, Medium scale organization. • Types of data that are processed on Cloud differ includes structured, unstructured or semi structured data. • Types of Database Service includes RDBMS type SQL based service or NOSQL.
An IDC report [1] predicts that by 2020 the global data volume will grow up to 40 zettabytes, and data is doubling every two years. Cloud Database Management System and Bigdata are integral terms. Cloud providers cannot handle such heterogeneous data with traditional Database Management Systems. RDBMS is not suitable because it has limited capacity and moreover it cannot explore original high fidelity data due to the different layers available for storage and processing. For such Cloud Databases as well as Bigdata Problems Companies are now coming up with many solutions, one of the simple solution among many is MapReduce .
Fig 1. Heterogeneous Cloud Database Management System MapReduce codes are more suitable for such situations due to the following reasons: • Map Reduce provides scalability. • Map Reduce codes are in the form of simple Key- Value pairs. • Large and complex Data processing is being done with the help of simple two functions i.e. Map and Reduce • Map reduce codes can be used to store data in an encrypted form. In this paper we have proposed one Generalized Model as a solution for such Cloud Databases as well as Bigdata Processing Problems. This system takes up queries in any Database Languages and the model convers the queries into simple Map Reduce Codes. Actual processing is being done with the help of Map Reduce Codes as shown in Fig 2. ORACLE
SQL
MYSQL
MAP REDUCE CODES
QUERY RESULT
Fig 2. System Architecture
NOSQL
DB2
Below are the segments of the Proposed Model • Firstly the User interface which can takes up the Database queries of the user. • Secondly, Compiler which converts the codes written in Database languages to the Map Reduce Codes. • Lastly Optimization technique is being proposed for MapReduce codes, Rest of the paper is organized as follows Current state of the work is described in section 2. Layerwise System Model, architecture and optimization technique is explained in Section 3 and 4. Lastly, section 5 concludes the paper.
2
Related work
2.1
MapReduce Functions
MapReduce Programming paradigm based on Parallel and distributed computing in which the Map and Reduce functions consists of Key-Value Pairs [2] [7]. Parallelism or the concept of pipelining is used to enhance the MapReduce functionality. Authors in [9] described that MapReduce consist of several intermediate phases like shuffle, sort and merge. These phases plays an important role with respect to effective resource allocation. MapReduce codes are used for Data Analysis on the similar kind of data sets hence Authors in [11] proposed one cloud View framework which is used for processing, analyzing and maintenance of the massive data sets. This framework uses the Map-Join-Reduce phases. Map Reduce task [12] deals with many problems one of the problem is scheduling the Map Reduce task in an effective manner. Author in [12] emphasis on two important factor i.e. Energy consumption and Simulation time Author in [4] described one system SQLMR which coverts SQL queries into MapReduce codes and compared the performance of this system with HIVE, HadoopDB. This paper is also based upon the same notion and applicable for all the Databases. There are several databases available for Cloud Databases but with limited query capabilities, one of the issue came where data needs to be integrated with local databases as well as Cloud Databases hence Author in [8] proposed BigIntegrator [8], one system which is used to process the data present at the local databases and Data located at the cloud repositories. 2.2
Optimization Technique for MapReduce
Author in [5] described one optimal algorithm to minimize the cross Rack communication. In Optimization they discussed about Reducer Placement problem and explained the one function for data coming in and coming out from the rack in terms of reducer. Authors in [6] described one improved algorithm for Data load balancing in Hadoop, They implemented the algorithm which can balance the data at overloaded Racks in preference.
3
System Design
Cloud Database Management System and Bigdata are related to each other as the data which is being present on the Cloud Databases are also recognized as Bigdata. Bigdata can be characterized by three ways i.e. Large amount of data in terms of Quintilian bytes, secondly, Data in the form of structured or unstructured form and Data that cannot be processed with the help of traditional RDBMS. Traditional RDBMS are not suitable for Cloud Databases as they cannot process large amount of such complex data. Users till date are well known to traditional Databases but, MapReduce, a simple programming model which works at Hadoop [3] framework is considered to be the suitable choice for such huge amount of data for the following reasons• MapReduce codes provides parallel processing on Large and Complex data. • MapReduce provides scalability. The Generalized Model proposed in this paper fulfills this gap.
3.1
System Architecture
There is a need of one System that can process the massive Data presented at Cloud Databases repositories without having much difficulty on user side. Fig 3 shows the complete structure and step by step working of such System called Generalized System. User
User Input as Query Database Query Result User interface
Database Compiler
to
MapReduce
MapReduce Code
Optimized Algorithm
Optimized Reduce Code
Map
Fig 3. System Architecture Firstly users send the queries through user interface, Compiler converts the queries into MapReduce Codes. Then optimized Algorithm for the placement the Map and
Processing Reduce Code
of
Map
Reduce function is applied in such a way to minimize the Cross Rack Communication for getting the best results in terms of running time. 3.2
Layer-wise System Model
System Architecture explained in the above section is described in terms of Cloud Database Layers [10]. Fig 4, Shows the layer wise System Model. Cloud Database Management System consist of five layers [10] i.e. External Layer which deals with the user Interface, Conceptual Middleware Layer and Physical Middleware Layer provides virtualization, Conceptual Layer deals with Processing of data and lastly Physical Layer deals with the effectively storage of data present at Cloud repositories. In this paper, the working of the proposed Generalized Model which is defined with the help of Cloud Database Layers as follows: External Layer provides the user interface here the user write queries in any Database Languages like SQL, Oracle, DB2 etc., then at Conceptual Middleware Layer queries written in different languages converted into MapReduce codes. Actual Processing is being done at Conceptual Layer. MapReduce Codes can be further improvised at Physical Middle ware and Physical Layer with the help of algorithms like cross rack communication for Mappers and Reducers.
Fig 4. Layer wise working of Model
4
Optimization Technique
Author in [5] described optimal algorithm for cross Rack Optimization in which they discussed about Reducer Placement problem and explained one function for data coming in and coming out from the rack in terms of reducer. i.e. - (1)
They assumed that mapper function is placed at the same rack where data is placed to keep the data locality and then finding out the effective position of reducer function on the racks so that to minimize the cross rack communication.
Fig. 5 Cross Rack Communication In this paper we have used the same optimal algorithm for cross rack optimization as shown in Fig 5 but with the difference that it is considering the placement of both the mapper function and reducer function on the DataNodes of the Racks. Earlier while placing Reducer function on the Racks DataNodes were not considered. Here, following are the assumptions that have been considered in this paper: • N are the number of Racks, • D are the DataNodes present at Racks, • M are the mapper function, • R are the Reducer function, For example, if Data is located at the DataNode1 of Rack1 then 90% probability is that mapper is placed over the same DataNode of same rack where the data is present. 7% probability is that mapper is placed at the other DataNode of same rack where data is presented and only 3% probability is that mapper is placed at the DataNode of other rack. Same apply for Reducer. Now, for each rack Si the amount of data sent from or coming into one rack to another is defined based on equation 1 but these functions are defined in terms of mapper and reducer. …………if mapper and reducer is placed at the same DataNode of same Rack where data is present, ………... if mapper is placed at the same DataNode where data is present but reducer is placed at the other DataNode of same rack. ………... if mapper is placed at the same DataNode of the same rack where data is present but reducer is placed at the different DataNode of the different rack. Where i=1, 2, 3………………., p are the number of racks. In this way a simple MapReduce functions can be placed based on these functions can improve the performance of the proposed Model as these functions are considering both the placement of Mappers and Reducer function into account.
5
Conclusion
We have proposed a model that can effectively process data at Cloud repositories to overcome the limitations of an existing traditional RDBMS system. Proposed model consist of three modules, firstly user interface where the users can put their queries in any database language, secondly the complier that converts the queries into MapReduce Codes and lastly the enhanced optimal Cross Rack Algorithm to minimize the cross Rack Communication which consider the placement of both mapper as well as reducer. In future the proposed model can be implemented with the help of existing programming language.
References 1. 2. 3. 4.
5.
6. 7.
8. 9.
10.
11.
12.
13.
J. Gantz and D. Reinsel, “The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the Far East” in Proc. IDC iView, IDC Anal. Future, 2012. J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Commun ACM, 51(1), pp. 107-113, 2008. Apache Hadoop, http://hadoop.apache.org. M. Hsieh, C.Chang, L. Ho, J. Wu and P. Lui. “SQLMR: A Scalable Database Management System for Cloud Computing”. International Conference on Parallel Processing (ICPP) 2011, pp.315-324. H. Li-Yung, W. Jan-jan, L. Pangfeng, “Optimal Algorithm for cross-Rack Communication Optimization in Map Reduce Framework”. IEEE International Conference on Cloud Computing 2011, pp. 420-427. K. Liu, G. Xu, J.Yuan ,“An improved Hadoop data Load Balancing Algorithm”, Journal of Networks, Dec 2013, Vol 8, Issue 12, pp. 2816-2822. D. Dahiphale, R. Karve, A. V. Vasilakos, H. Liu, Z. Yu, “An Advanced MapReduce: Cloud MapReduce, Enhancements and Applications”. IEEE Transaction on Network and Service Management 2014, Volume 11, Issue 1, pp. 101-115. M. Zhu and T. Risch. “Querying Combined Cloud-Based and Relational Databases”. International Conference on cloud and service computing (CSC) 2011, pp. 330-335. Q. Zhang, M. F. Zhani, Y. Yang, B. Wong, “PRISM: Fine Grained Resource-Aware Scheduling for MapReduce.” IEEE transactions on Cloud Computing 2015, Vol 3, Issue 2, pp. 182-194. S. Mongia, M.N. Doja, B. Alam, M. Alam, “5 layered Architecture of Cloud Database Management System”, AASRI Conference on parallel and Distributed Computing and Systems, Vol 5, pp. 194-199, 2013. R. Bhardwaj, N. Mishra, R. Kumar, “Data analyzing using Map-Join-Reduce in cloud storage”, IEEE 2014 International conference on Parallel, Distributed and Grid Computing, 2014, pp. 370-373. Q. Althebyan, Q, Q. Qudah, Y. Jaraweh, Q. Yaseen, “Multi-threading based Map Reduce tasks scheduling”, 2014 IEEE International Conference on Information and Communication Systems (ICICS), pp. 1-6. Alam, Mansaf, Kashish Ara Shakil, and Shuchi Sethi. "Analysis and Clustering of Workload in Google Cluster Trace based on Resource Usage."arXiv preprint arXiv:1501.01426 (2015).
14. Mansaf Alam, Kashish Ara Shakil, A Decision Matrix and Monitoring based Framework for Infrastructure Performance Enhancement in Cloud Based Environment, Advances in Engineering and Technology Series, Vol-7, page, 147 – 153, Elsevier(2013