Performance Comparison Between Apache Hive and Oracle SQL for Big Data Analytics Rotsnarani Sethy ✉ , Santosh Kumar Dash, and Mrutyunjaya Panda (
)
Department of Computer Science, Utkal University, Bhubaneswar, India
[email protected]
Abstract. Big data shall mean the massive volume of data that could not be stored, processed and managed by any traditional database management systems. Big Data Analytics becoming a comprehensive research area today this has attracted to all academia and industry to extract knowledge and information from a large amount of data. Oracle SQL is a prominent DBMS and is used worldwide. As the data goes bigger the running time is increasing in Oracle SQL. With the help of Apache Hive, we can do a large scale of data analysis in minimal time period. Apache Hive expedites for reading, writing and managing big datasets in distributed environment using SQL. Whereas Oracle SQL provides integrated development domain for running queries and scripts. In this paper, we have taken few queries for analysis for some smaller data sets as well as larger data sets and we have done an analysis for both Apache Hive and Oracle SQL environment. Keywords: Big data · Apache hadoop · Apache hive · Oracle SQL · Query processing
1
Introduction
In last 10 years, substantial growth has been comprehended in the field of Information Technology. We are advancing in the terms of manipulating and developing automated solutions to real world problems which materialize Big Data Analytics as a recent research area. Big Data refers to a huge volume of data that can be structured, unstruc‐ tured or semi-structured that has the perspective to be mined for information and these voluminous amounts of data could not be stored, processed and managed expeditiously by traditional database methods and tools within a given time frame. This enormous increase of data creates the need for exploring the proper tools, techniques, and frame‐ works for Big Data Analytics including high volume, velocity, and variety [1, 2]. Hadoop and Apache Hive currently developed tools intended for taking care of these vast amounts of data (Big Data) and its related issues [3]. Apache Hive is an open source application for analyzing massive data sets in a highlevel language [4]. It is data warehouse software that facilitates queries and manages big data set in distributed storage and it is run on top of Hadoop.
© Springer International Publishing AG 2018 A. Abraham et al. (eds.), Proceedings of the Eighth International Conference on Soft Computing and Pattern Recognition (SoCPaR 2016), Advances in Intelligent Systems and Computing 614, DOI 10.1007/978-3-319-60618-7_14
Performance Comparison Between Apache Hive and Oracle SQL
131
Hadoop is a famous open source implementation of Map Reduce written in java that allows distributed processing of large datasets across clusters of computers using simple programming models which are used by both academia and industries. Hadoop solves different issues and challenges of Big Data [5]. It is designed to scale up from a single server to thousands of machines, each offering local computation and storage [6]. Oracle SQL is a set of declarative languages which provides an interface in Relational Database Management System such as Oracle Database [7]. SQL provides a non-proce‐ dural platform for users and it is considered as ANSI standard language and we are using SQL for performing operations on SQL statements. In this paper, we have used it for processing some queries and later we will make a performance comparison between Apache Hive and Oracle SQL. The rest of the paper is organized as follows: Sect. 2 discusses the related work available in the literature. Section 3 presented the proposed methodology followed by the Dataset description. Experimental results are discussed in Sect. 4. Finally, we conclude with a conclusion and future scope of research in Sect. 5.
2
Related Work
Definitions of Big Data, different tools and frameworks and characteristics of Big Data have been introduced in [1]. The author [2] proposed the Multiple Query Optimizationbased framework i.e. distributed Hive to improve the performance of conventional Hadoop Hive. Also, the author gives the brief information about the architecture of distributed Hive which is the modified version of Hadoop Hive with new MQO compo‐ nent. In [3] one column statistics approach has been developed for improving the Performance of Hive Query Language (Hive QL) queries for executed in the Hive framework. The Authors [4] presented a Hive profiling tool based on log analysis which can be a useful tool to test the impact of new software features and it can efficiently substitute the hand drawn tables and charts into new improvements. They have demon‐ strated that their tool is able to assist developers in comparing HIVE queries written in different formats, configured in different parameters and running on different data sets. The authors [5] pointed out some of the major issues in big data storage, management, and processing. They also established some of the major challenges of Big data that is going forward. The authors [6] described the Big Data and Hadoop in detail. Also, they have introduced the general background of Big Data and Hadoop Platform using MapReduce Algorithm. Also focused on Hadoop components which are used to support the processing of large data sets in distributed computing environments. Authors [7] have presented HIVE an open source data warehousing solution built on top of Hadoop. Also, they have introduced that Hive supports Queries expressed in an SQL-like declar‐ ative language that is Hive QL which executed in Hadoop. • Motivation It is observed from literature that many researchers have tried to perform Big Data Analytics using traditional methods which resulted in poor performances due to memory constraints. Hence, we are motivated to explore the suitability of Apache
132
R. Sethy et al.
Hive as a distributed database for faster retrieval in comparison to popular Oracle SQL approach. • Objective In this paper our objective is – Performance Analysis of Oracle SQL w.r.t. time – Performance Analysis of Apache Hive w.r.t. time. – Performance Analysis of Mean processing time between Apache Hive and Oracle SQL. • Apache Hive Model for Query Processing In Fig. 1, there are nine steps for processing a query in Apache Hive on Hadoop. All stages of the hive are consisting of main components like UI, Driver, Compiler, Metastore and Execution Engine. In step 1 User Interface calls the execute query to the driver. In step 2, the driver creates a session handle for the query and sends it to the compiler for generating an execution plan for execution engine. In step 3 and step 4, the compiler gets the required metadata from Metastore. In step 5, compiler sends generated plans to the compiler. All components of Apache hive are linked to HDFS model in (steps 6.1, 6.2, and 6.3). In steps (7, 8 and 9), it performs mapper and reducer operation [8, 9].
Fig. 1. Apache hive architecture [9]
Performance Comparison Between Apache Hive and Oracle SQL
133
• Oracle SQL Model for Query Processing In Fig. 2, all stages of Oracle SQL query processing is shown. The above process describes how oracle database, processes an SQL statement. And it also tells how DDL statements are creating objects, how DML statements are modifying data and how we are retrieving queries from an Oracle Database. Here, Query processing has been taken in four ways: - parsing, optimization, row source generation and execution of SQL statements. The above steps may change according to query statements [10].
Fig. 2. Stages of oracle SQL query processing
3
Proposed Methodology and Data Set Description
We have four parameters for determining results, the first parameter is taking the number of instances, the second is taking the query statements into consideration, third is the file size of the dataset and Data Retrieval and the final one is Performance comparison between Apache Hive and Oracle SQL and output Analysis, which have shown in Fig. 3 (our proposed model). Here four datasets are used and we have collected it from UCI Machine Learning Repository and kaggle. First Dataset: - Online Video Charac‐ teristics and Transcoding Time Dataset have been taken for our analysis with 168286 numbers of Instances. The Video dataset contains 10 columns. The columns are YouTube video id, duration, bitrate (total in Kbits), bitrate(video bitrate in Kbits), height(in pixels), width(in pixels), frame rate, estimated frame rate, codec, cate‐ gory, and direct video link. Video data set can be used for determining characteristics of consumer videos found on UGC (Youtube) [11]. Second data set: - Record Linkage Comparison Patterns Data Set have been used with 5749132 number of instances. The record dataset contains 12 columns. The columns are id_1, id_2, cmp_fname_c1, cmp_fname_c2,cmolname_c1, cmp_lname_c2, cmp_sex, cmp_bd, cmp_bm, cmp_by, cmp_plz and is_match. This dataset is an Element-wise comparison of records with personal data from a record linkage setting. It can be used to decide from a comparison pattern whether the underlying records belong to one person [12]. Third Dataset: -3D
134
R. Sethy et al.
Road Network (North Jutland, Denmark) Dataset have been used with 434874 number of instances. The road dataset has 4 columns. The columns are OSM_ID, LONGITUDE, LATITUDE, and ALTITUDE. Road dataset can be used in eco-routing and fuel/Co2estimation routing algorithms [13]. Fourth data set- Rate (Health-Insurance-Market place) Dataset has been used with 13,000000 numbers of instances. The rate dataset consists of 23 columns. The columns are BusinessYear, StateCode, IssuerId, Source_Name, Version_Num, ImportDat, Issuer_Id2, FederalTIN, RateEffectiveDate, RateExpiraon Date, PlanId, Rating AreaId, Tobacco, Age, Individual Rate, Individual Tobacco Rate, Couple, Primary Subscriber And One Dependent, Primary Subscriber And Two Dependents, Primary Subscriber And Three Or More Dependents, Couple And One Dependent, Couple And Two Dependents, couple and Three Or More Depend‐ ents, and RowNumber. The rate dataset is the Health Insurance Marketplace Public Use Files contain data on health and dental plans offered to individuals and small businesses through the US Health Insurance Marketplace [14].
Fig. 3. Proposed model.
We have taken smaller datasets (online video) as well as a bigger dataset (3D Road Network, Record Linkage, Rate Dataset) for understanding the suitability of Oracle SQL and Apache Hive for data retrieval purpose which may be an ingredient for future Big Data Analytics.
4
Experimental Results
All experiments have been carried out on Cent O.S version 7 with 8 GB of RAM and 2 core processor in a virtual machine. The results obtained are summarized in Tables 1, 2, 3, 4, 5, 6, 7, 8, and 9. And Figs. 4, 5, 6 and 7 shows the Column chart for Performance comparison between Apache Hive and Oracle SQL for data set Video, Record, 3-D Road Network, and Rate Data Set respectively.
Performance Comparison Between Apache Hive and Oracle SQL
135
Table 1. Query statements for performance analysis of Apache Hive and Oracle SQL Query Description Query 1 Retrieving Unique column using DISTINCT Query 2 Retrieving Records from a given dataset using ORDER BY for general Sorting Query 3 Retrieving Records Using ORDER BY and DESC for Backward Sorting Query 4 Using COUNT and GROUP BY for Retrieving records and their count. Query 5 Using MAX aggregate function for retrieving MAXIMUM value from a record
Statements Retrieving unique Output Records Sorting Sorting Backward Grouping with Counting Maximum Value
Table 2. Hive query result table for video dataset Query(168 k) Query 1 Query 2 Query 3 Query 4 Query 5
Attempt 1 100 74 63 70 69
Attempt 2 62 69 63 55 51
Attempt 3 60 66 61 54 50
Mean 74 69.66 62.33 59.66 56.66
Table 3. Oracle SQL query result table for video dataset Query(168 k) Query 1 Query 2 Query 3 Query 4 Query 5
Attempt 1 4.63 9.042 11.15 1.46 1.47
Attempt 2 4.62 9.041 11.00 1.93 1.93
Attempt 3 3.61 9.041 10.89 0.92 0.92
Mean 4.28 9.041 11.01 1.43 1.44
Table 4. Hive query result table for record dataset Query(574 k) Query 1 Query 2 Query 3 Query 4 Query 5
Attempt 1 45 37 43 32 32
Attempt 2 19 19 20 21 15
Attempt 3 19 14 15 18 11
Mean 27.66 23.33 26 23.66 19.33
136
R. Sethy et al. Table 5. Oracle SQL query result table for record dataset Query(574 k) Query 1 Query 2 Query 3 Query 4 Query 5
Attempt 1 76.65 62.45 78.78 5.00 2.84
Attempt 2 45.46 60.78 55.58 4.95 2.47
Attempt 3 41.70 48.71 49.17 4.70 1.43
Mean 54.60 57.31 61.17 4.888 2.24
Table 6. Hive query result table for road dataset Query(438 k) Query 1 Query 2 Query 3 Query 4 Query 5
Attempt 1 24 25 29 21 22
Attempt 2 11 15 15 10 12
Attempt 3 11 13 14 10 10
Mean 15.33 17.66 19.33 13.66 14.66
Table 7. Oracle SQL query result table for road dataset Query(438 k) Query 1 Query 2 Query 3 Query 4 Query 5
Attempt 1 27.81 31.76 27.45 30.16 0.031
Attempt 2 28.56 26.15 25.23 25.45 0.016
Attempt 3 23.11 20.45 21.85 22.70 0.015
Mean 26.46 26.12 24.84 26.10 0.020
Table 8. Hive query result table for rate dataset Query(13 M) Query 1 Query 2 Query 3 Query 4 Query 5
Attempt 1 945 814 920 709 699
Attempt 2 418 413 403 454 330
Attempt 3 388 308 327 389 246
Mean 583.66 511.66 550 517.33 425
Performance Comparison Between Apache Hive and Oracle SQL
137
Table 9. Oracle SQL query result table for rate dataset Query(13 M) Query 1 Query 2 Query 3 Query 4 Query 5
Attempt 1 1694.73 1380.14 1741.82 110.55 10.70
Attempt 2 1005.21 1343.84 1228.87 109.44 10.47
Attempt 3 921.98 1074.97 1087.14 103.91 10.45
Mean 1207.06 1266.31 1352.61 107.96 10.54
Fig. 4. Performance comparison between Apache Hive and Oracle SQL for video dataset. Here, the X-axis shows queries and Y-axis show the time in second.
138
R. Sethy et al.
Fig. 5. Performance comparison between Apache Hive and Oracle SQL for record dataset. Here, The X-axis shows queries and Y-axis show the time in second.
Fig. 6. Performance comparison between Apache Hive and Oracle SQL for road dataset. Here, The X-axis shows queries and Y-axis show the time in second.
Performance Comparison Between Apache Hive and Oracle SQL
139
Fig. 7. Performance comparison between Apache Hive and Oracle SQL for rate dataset. Here, The X-axis shows Queries and Y-axis show the time in second
Bellow line graph shows that all the data sets are different in processing time according to their number of tuples and volume of the dataset. In the case of video data set, Oracle SQL is taking lower time than Apache Hive. Rest in all datasets (Road, Record, and Rate) Apache Hive is taking lower time than Oracle SQL.
5
Conclusion and Future Scope
When we are processing small scale of data, Apache hive is taking more time than Oracle SQL. In the case of large data sets, Apache hive is very efficient for retrieving data sets whereas Oracle SQL performs poorly. Queries involved with Group By, Order By, aggregate function etc. are taking more time as compared to a retrieving entire dataset or retrieving a particular column in Apache Hive. In Oracle SQL, retrieving entire dataset or retrieving a column is taking more time than retrieving using an aggregate function, COUNT etc. In the case of Apache Hive, the average time is more whereas the number of rows is less. But, in Oracle SQL, average time increases on a number of rows. Bellow, in Fig. 8, we have plotted a line chart for a better understanding of the average time of four datasets. In the case of video data set, a number of rows are less so Apache Hive taking more time than Oracle SQL but rest three datasets are having more number of rows. so Apache Hive is taking less time than Oracle. In future, we will take large-scale (In TB) datasets and do analysis on both Apache Hive and Oracle SQL for the perform‐ ance test.
140
R. Sethy et al.
Fig. 8. The above figure shows the overall processing time between Apache Hive and Oracle SQL
References 1. Chawda, R.K.: Big data and advanced analytics tools. In: Symposium on Colossal Data Analysis and Networking (CDAN) (2016) 2. Garg, V.: Optimization of multiple queries for big data with apache Hadoop/Hive. In: 2015 International Conference on Computational Intelligence and Communication Networks, pp. 938–941 (2015) 3. Gruenheid, A., Omiecinski, E., Mark, L.: Query optimization using column statistics in hive. In: Categories and Subject Descriptors (2016) 4. Haryono, G.P., Zhou, Y.: Profiling apache HIVE query from runtime logs. In: International Conference on Big Data Smart Computing BigComp, pp. 61–68 (2016) 5. Kaisler, S., Armour, F., Espinosa, J.A., Money, W.: Big data: issues and challenges moving forward. In: 2013 46th Hawaii International Conference on System Science, pp. 995–1004 (2013) 6. Sethy, R., Panda, M.: Big data analysis using hadoop: a survey. IJARCSSE 1153–1157 (2015) 7. Thusoo, A., Sen, S.J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive - A petabyte scale data warehouse using Hadoop. In: Proceedings of the International Conference on Data Engineering, pp. 996–1005 (2010) 8. Loshin, D.: Big Data Tools and Techniques, pp. 61–72 (2013). Chapter 7 9. Hive Architecture. https://cwiki.apache.org/confluence/display/Hive/Design 10. Introduction to Oracle Database. https://docs.oracle.com/database/121/CNCPT/intro.htm#CNCPT001 11. Online Video Characteristics and Transcoding Time Dataset Data Set (2015). https:// archive.ics.uci.edu/ml/datasets.html
Performance Comparison Between Apache Hive and Oracle SQL
141
12. Record Linkage Comparison Patterns Data Set (2011). https://archive.ics.uci.edu/ml/ datasets.html 13. 3D Road Network (North Jutland, Denmark) Data Set (2013). https://archive.ics.uci.edu/ml/ datasets.html 14. Rate Data Set (2015). https://www.kaggle.com/hhsgov/health-insurance-marketplace