Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
A Comparative Study of Enterprise and Open Source Big Data Analytical Tools 1
Udaigiri Chandrasekhar 2Amareswar Reddy3Rohan Rath School of Information Technology and Engineering VIT University Vellore, India 1
[email protected] 2
[email protected] 3
[email protected]
Abstract: In this paper, we bring forward a comparative study between the revolutionary enterprise big data analytical tools and the open source tools for the same. The Transaction Processing Council (TPC) has established a few benchmarks for measuring the potential of software and its use. We use similar benchmarks to study the tools under discussion. We try to cover as many different platforms for big data analytics and compare them based on computing environment, amount of data that can be processed, decision making capabilities, ease of use, energy and time consumed, and the pricing. Keywords: Big data, enterprise, open source, analytical tools, Hadoop, business intelligence, metadata, MapReduce, SQL, security, reliability.
1
INTRODUCTION
We are in a flood of data today. Statistics show that 90% of the world’s data was generated in the last two years itself, and it is growing exponentially. To tackle such data and process it, we need to leave the traditional batch processing behind and adopt the new big data analytical tools. The data generated everyday exceeds 2.5 quintillion bytes which is a mind-boggling figure. The growth of data has affected all fields, whether it is the business sector or the world of science. To process such huge amounts of data various new tools are being introduced by companies like Oracle and IBM, while on the other handle Open Source Developers continue their work in the same field. 1.1
What is Big data?
Big Data is the massive amounts of data that collect with time and are difficult to analyze using the traditional database system tools. Big Data includes business transactions, photos, surveillance videos and activity logs. Scientific data from sensors can reach massive proportions over time, and Big Data also includes unstructured text posted on the Web, such as blogs and social media.
978-1-4673-5758-6/13/$31.00 © 2013 IEEE
1.2
Managing and Analyzing Big Data
For the past two decades most business statistics have been created using structured data produced from functional techniques and combined into a information factory or data warehouse. Big data significantly improves both the number of information resources and the variety and number of information that is useful for research. A significant number of this data is often described as multi-structured to differentiate it from the arranged functional data used to fill a data warehouse. In most companies, multi-structured data is growing quicker than structured data. Two important information management styles for handling big data are relational DBMS products enhanced for systematic workloads (often known as analytic RDBMSs, or ADBMSs) and non-relational techniques (sometimes known as NoSQL systems) for handling multi-structured data. A nonrelational system can be used to produce statistics from big data, or to preprocess big information before it is combined into a data warehouse. 1.3
Need for Big Data Analysis
When a business can make use of all the information available with large data rather than just a part of its details then it has a highly effective benefit over the market opponents. Big Data can help to gain ideas and make better choices. Big Data provides an opportunity to create unmatched company benefits and better service distribution. It also needs new facilities and a new way of thinking about the way company and IT market works. The idea of Big Data is going to change the way we do things today. The International Data Corporation (IDC) research forecasts that overall details will develop by 50 times by 2020, motivated mainly by more included systems such as receptors in outfits, medical gadgets and components like structures and connects. The research also identified that unstructured details - such as data files, email and video - will account for 90% of
372
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) all details designed over the next several years. But the number of IT experts available to handle all that details will only develop by 1.5 times the present levels. The electronic galaxy is 1.8 billion gigabytes in dimension and saved in 500 quadrillion data files. And its dimension gets more than dual in every two years’ period of time. If we evaluate the electronic universe with our actual universe then it's nearly as many pieces of details in the electronic universe as stars in our actual universe. Characteristics of Big Data
1.4
A Big Data platform should give a solution which is designed specifically with the needs of the enterprise in the mind. The following are the basic features of a Big Data offering• Comprehensive - It should offer a wide foundation and address all three size of the Big Data task -Volume, Variety and Velocity. • Enterprise-ready - It should include the performance, security, usability and reliability features. • Incorporated - It should easily simplify and speeds up the release of Big Technological innovation to business. • Open Source based - It should be start resource technology with the enterprise-class overall performance and incorporation. • Low latency flows and updates • Solid and fault-tolerant • Scalability • Extensible • Allows adhoc queries • Little maintenance
2
ENTERPRISE BIG DATA ANALYTICAL TOOLS VERSUS OPEN SOURCE BIG DATA ANALYTICAL TOOLS
Ever since the research on big data started, companies like IBM, Google and Oracle have been leading the race of enterprise analytical tools and have been occupying the major section of the market. Little is known about the new age analytical tools that have been developed recently and are spreading faster than expected. In this paper, we bring to light the working characteristics of many such tools for big data. Keeping aside all the expensive and closed source applications, we also strive to explain the working and advantages of open source analytical tools, which are as good as their counterparts in the enterprise world. We explain where these can be used and what are the issues involved in the same. 2.1 2.1.1
Enterprise Analytical Tool Pentaho
978-1-4673-5758-6/13/$31.00 © 2013 IEEE
Pentaho is a very useful tool to visualize high volumes of data and analyze it to draw conclusions. It supports all the right set of tools for the entire processing lifecycle of big data. It provides exploration and visualization features for the business sectors; also performs predictive analysis. It provides a 15 times boost to scripting and coding. It is very useful for interactive reports, time series forecasting, statistical learning, evaluation and visualization of predictive models. It supports Predictive Modeling Markup Language. It provides visual tools that define instant access to data. It is built on a modern high performing, lightweight platform. This platform fully works on a modern 64 bit multi-core processor and harnesses the power of new-age hardware. Pentaho is unique in leveraging external data grid technologies such as Infinispan and Memcached to load vast amounts of data into memory. The Instaview feature on Pentaho enables us to instantly view the reports generated by careful analysis of the data in a multi-dimensional and interactive format.
Fig1. Instaview feature on Pentaho The problem is with the pricing. During the survey, it was found that a pricing request needs to be sent to Pentaho Big data Analytics in order to view the quotation. But reviews show that it is priced at a much lesser price as compared to other commercial BI tools. 2.1.2
TerraEchos
TerraEchos, a world leader in innovative security alternatives utilizing Big Data in Movement statistics, was known as the champion of a 2012 as it received the IBM Beacon Award for Outstanding Information Management Innovation. TerraEchos, a next-generation big-data statistics company, has implemented the first foundation to blend and narrow large volumes of live complicated structured and unstructured data on the fly — and at the same time to draw out, evaluate, and act upon the data, in the moment. Unlike database-centric techniques to high-speed, highvolume data research, the TerraEchosKairos foundation is not restricted by the amount, type, or rate of the data: It consistently examines data while it is still moving, without requiring storing it. Based on a streaming operating system, and with statistics and creation segments updated to specific
373
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) programs, the TerraEchosKairos foundatiion is perfect for protection, intellect, facilities, and growingg professional, and other marketplaces where understanding annd performing right now is crucial.
Fig2. Annual Lobbying by TerraE EchosInc 2.1.3
Cognos
Cognos is known for business intellligence, financial performance and strategy management as well. w It also extends to work for analytics applications. IBM's Cognos C Software is known to provide an organization with whaatever they need to become top-performing and analytics drivven.With items for the individual, workgroup, division, midssize company and large business, Cognos application has been b developed to ensure that companies make better choices for their future and w to grow their development. This tool is for those who wish company's intellect or performance. Cognos solutions are designed to help you make smarter o business decisions and manage performance for optimized results. Combine your financial and operaational data into a single, seamless source of information in the t environment of your choice for transparent and timely repoorting, analysis and action. IBM offers a full range of customer suppport, training and certification, and services for Cognos annd other Business Analytics customers.
2.1.4
Attivio
The AIE or Active Intelligencce Engine of Attivio has had a huge impact on the businesss through the availability of customers' information assets. It helps in a quick analysis of w ways they need it. It the situation and help them in whatever helps to fulfil a strategic goal whhich has been established. AIE has the capabilities to analyse anything on many platforms and brings together booth structured and unstructured data content for an efficient annd agile BI. It is equipped with power to incorporate and relaate all the available data and content silos - without any advance modelling of data a intuitive Google like requirement. It has a very advance search capability for BI whichh incorporates all the needed analytical tools. It offers both intuitive searchh capabilities along with SQL powers, making structured and unstructured data more useful in ways never been thought of. It has great insight in the field f all kinds of users and their of big data and can be useful for varied technical skills and prioriities. 2.1.5
Google BigQuery
Google BigQuery allows you y to run SQL-like queries against very large datasets, witth potentially billions of rows. This can be your own data, or o data that someone else has shared for you. BigQuery workks best for interactive analysis of very large datasets, typicallyy using a small number of very large, append-only tables. It’s a high speed tool which can c analyse billions of rows in seconds. It can handle trillioons of records pertaining to terabytes of data. It’s simplicitty stands out as it works with close reference to SQL. It has a powerful sharing through groups and user based Google Accounts. It works on a very secure SSL access method. It has multiple access methods through BigQuery browser, thee REST API or Google Apps Script. It has a very distinguished priicing range which varies as per storage, interactive queries and batch queries which are $0.12 (per GB/month), $0.035 (per GB G processed), $0.02 (per GB processed) respectively. They all have default limitts assigned too.This Netezza example is a fine example of a system that has been designed and optimized for processing a specific workload: business analytics.
Fig3. Query Analyser for Coggnos An investment of over a lakh in Indian currrency is needed for the effective use of Cognos. It seems leggit for a company to invest in it, but for personal usage it may not be possible for the general public. Fig4. Workplace forr Google BigQuery
978-1-4673-5758-6/13/$31.00 © 2013 IEEE
374
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) 2.1.6
Netezza
A revolutionary data analytical tool was established by IBM, and was named Netezza. It is an advancedd high performance warehouse engine that incorporates a dattabase, server and storage components into a single environmeent. This is used to run predictive analysis, business intelliigence and many applications needed in every field. It is based on IBM blade architecturee which uses x86 processors; an Asymmetric Massively Parallel P Processing (AMPP) approach to process workloads; and a it uses FPGAs (field programmable gate arrays — specialiized processors) as a means to filter data before it is processsed. This FPGA is used to speed the process of queries and perform p a function which preprocesses. In this, computatiional kernels are processed by FPGAs as opposed to makingg the CPU do all of the work. The Netezza design has led to an exponeential growth in the field of big data analytics and is very useful u as a hybrid commodity or specialty environment. It is not only ideal for processing compplex, scanning type queries such as those found when performinng deep analytics. It preprocesses data, and then feeds the CPU C in a balanced fashion. Further, it is not burdened byy legacy database structures and online transaction processsing features — resulting in a simple code path for faster perrformance. 2.2 2.2.1
Open Source Analytical Tools Apache Hadoop
Licensed under Apache v2, Apache Haadoop is an open source system framework that not only runs r on distributed platform using data intensive processing. For commodity use, it supports running on large clusters. Suppoorting data mobility and provides security and reliability for dataa processing. It has its own Hadoop computation paradigm called c MapReduce, where in the work is divided into various units and then processed on a clustered system or a grid. It has a capacity to handle petabytes of data d in millions of files within seconds and provides a very high h bandwidth of computational processing. Cloudera is the leader in Apache Hadooop-based software and services and offers a powerful new data platform that enables enterprises and organizations to look at all their data — structured as well as unstructured. Hadoop is written in the Java programminng language and is a top-level Apache project being built andd used by a global community of contributors Hadoop.
978-1-4673-5758-6/13/$31.00 © 2013 IEEE
o Apache Hadoop Fig5. Work Flow of 2.2.2
Zettaset
One of the most flexible oppen source analytical tools, it works on any Apache Hadoop distribution. d Its features include availability and security of the data. d It is easy to deploy on any system and is very cost effecctive. Simplicity is Zettaset’s second name. Zettaset has created Orchestrrator, software solution that is an enterprise management toool that addresses the common issues of Hadoop deployment with w easy-to-use interfaces that are sophisticated tools. It has the capacity to automaate, simplify and accelerate the installation of Hadoop on a clusster management system which is ready for enterprise usage. The outstanding feature of Orchestrator is that it is not bassed on Hadoop distribution but is a very secure open system. 2.2.3
HPCC Systems
Abbreviated as HPCC, Hiigh Performance Computing Cluster, as the name suggests iss a clustered computing system. It was developed by LexisNexiss Risk Solutions. There are two versions to this tool, paid as well w as free and both work on structured and unstructured content c data. It has a high performing scalability from 1-1000s of nodes. Parallel processing makes it even strongger a tool. It is commercially available and offers a lot of features for a tool so easily available to the masses. m A major advantage of selecting this tool is a platform m for data-intensive computing includes a highly integratedd system environment with capabilities from raw data processing to high-performance queries and data analysis using a common language. Working as an optimized cluuster, it is a very low costing high performing system resultinng a very low TCO (Total Cost of Ownership) along with secuurity, scalability and reliability with a very good processing sppeed. It has an innovative datacentric programming language incorporated which increases programmer productivity for development of applications on this platform. It has a good tolerance for faults and capabilities for reprocessing in case of system faailures. It can also manage data warehouses, and high volume online o applications to network security analysis of massive am mounts of log information.
375
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) 2.2.4
Dremel
An interactive ad-hoc query system, it was developed by Google to offer analysis of nested readable data. Scalable to extremes; up to 1000s of computer systems and petabytes of data; it has the power to process trillions of rows together in just a matter of seconds by multi-level execution of trees and columns. It is not meant for replacing the old system, MR and is often used for analysis of crawled web documents, tracking install data for applications on Android Market and also for crash reporting for Google products. Other important uses are spam analysis, debugging of map tiles on Google Maps, tablet migrations in managed Bigtable instances. 2.2.5
Greenplum HD
Greenplum HD allows customers to start with big data statistics without the need to develop an entire new venture. It is provided as application or can be used in a pre-configured Data Handling Equipment Component. Greenplum HD is a 100 percent open-source qualified and reinforced edition of the Apache Hadoop collection that contains HDFS, MapReduce, Hive, Pig, Hbase and Zookeeper. IT prevails of a finish information research foundation and it brings together Hadoop and Greenplum data resource into only one Data Handling Equipment. Available as application or in a preconfigured Data Handling Equipment Component, Greenplum HD provides a finish foundation, such as set up, training, international support, and value-add beyond simple appearance of the Apache Hadoop submission. Greenplum HD makes Hadoop quicker, more reliable, and easier to use. Greenplum HD facilitates Isilon’sOneFS Scale-Out NAS Storage space for Hadoop. EMC Isilon scale-out NAS is the first and only Business NAS remedy that can natively include with the Hadoop Allocated Data file System (HDFS) part. By dealing with HDFS as an over the cable method, you can quickly set up a extensive big data statistics remedy that brings together Greenplum HD with Isilon scale-out NAS storage systems to provide a very effective, extremely effective and versatile information storage and statistics environment. The Greenplum HD DCA Component easily combines the Greenplum HD application into a product, offering an enhanced setting designed for performance and stability. The Greenplum Data Handling Equipment marries the unstructured batch-processing power of Hadoop with the Greenplum Database and the cutting-edge Extremely Similar Handling (MPP) structure. This allows businesses to draw out value from both arranged and unstructured data under only one, smooth foundation. 2.2.6
and in any framework in a simple and cost-effective way. Apache Hadoop is a key of the Hortonworks framework. It is ideal for organizations that want to combine the power and cost-effectiveness of Apache Hadoop with the amazing alternatives and balance required for organization deployments. Hortonworks is the latest organization of Hadoop system and expert support, but it's an old element when it comes to working with the platform. The organization is a 2011 spinoff of Search engines, which remains one of the greatest clients of Hadoop. Actually, Hadoop was usually developed at Search engines, and Hortonworks managed an extensive broad variety of nearly 50 of its very first and most well-known associates to Hadoop. There’s differentiator between Hortonworks and the other suppliers. Hortonworks products are 100% Begin Source and are free contrary to some of and Cloudera’s Company amazing and/or value-adding Hadoop products, which are not. 2.2.7
ParAccel
The open source tool, ParAccel data analytics platform has been used by organizations for its interactive capabilities to analyze big data in an enhanced fashion. It offers a high storage along with compression of adaptive capabilities. Inmemory processing and compilation on the fly is also important, making it easy to work with and adapt to.
Fig6. ParAccel Query Analyzer 2.2.8
GridGrain
An enterprise open source system, GridGain, as the name suggests is for grid computing. This was specially made for Java and is compatible with Hadoop DFS and offers an alternative for Hadoop's MapReduce. It offers a distributed, in-memory and scalable data grid, which is the link between data sources and different applications. An open source version is available on Github or a commercial version can be downloaded from their homepage.
HortonWorks
Hortonworks is an authentic and free Hadoop Distribution system. It is developed on top of Hadoop and it allows clients to capture process and perform together at any broad variety
978-1-4673-5758-6/13/$31.00 © 2013 IEEE
376
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) 3
COMPARISON OF SECURITY BETWEEN OPEN SOURCE AND ENTERPRISE TOOLS
Although you can take an open source project, compare it against a closed source project, and say that one is more secure than the other based on some number of observations or measurements; this determination will probably be based on factors other than the nature of the project's open or closed source code. Secure design, source code auditing, quality developers, design process, and other factors, all play into the security of a project, and none of these are directly related to a project being open or closed source. It is shocking to see the vulnerabilities in some closed source systems. And although it certainly wouldn't mean open source software is quantitatively "more secure" than closed source software, it just means that there is a doubt in the source code auditing principles and otherwise the general security practices of certain closed source operating system vendors. However, the issue here isn't specifically related to the operating system being open or closed source, but to the processes with which the vendor approaches security. Although this might seem to imply that open source projects are going to have less vulnerabilities than closed source projects, that's not really the case either; the number of vulnerabilities present in a given system can’t be simply associated with the openness of its source code. Ultimately, it's about the way the project and its developers handle and integrate security.
4
SELECTING THE RIGHT TOOLS FOR DATA ANALYTICS
The factors discussed in the paper have a significant impact on technology selection.Organizations are not ready to make risky investment strategies in expensive alternatives just in case there is something more to be discovered. This is where multiple alternatives come into play. Existing exclusive – and generally expensive – storage space and data resource alternatives are being formulated by some of the more costeffective growing technology, generally from the Apache Hadoop atmosphere. Initial discovery and research of large information amounts, where the "nuggets" are well invisible, can be performed in a Hadoop atmosphere. Once the "nuggets" have been discovered and produced, a decreased and more organized information set can then be fed into a current information factory or statistics system. From that viewpoint, it makes overall sense for providers of current storage space, data resource, and information warehousing and statistics software to provide connections and APIs to Hadoop alternatives. And also put together incorporated promotions that work with both the exclusive and free components. While some of them hurry to accept Hadoop, there is no evidence that it is a sensible and suitable move. As already described, many of the new big data technology are not ready for popular business utilization, and organizations without the IT abilities of the trailblazers or
978-1-4673-5758-6/13/$31.00 © 2013 IEEE
common early adopters will welcome the support from recognized providers.
5
CONCLUSION
To conclude, after the analysis of both closed and open source Big Data Tools, it is pretty evident that it's all about the usage and needs of an individual or the company. It is impossible to afford a few tools at a personal level because of the prices and complications, while using open source systems might pose an outdating and modifications problem. There is also the security issues involved in choosing the tool. Open source promotes development and innovation and supports developers. Big data is on every CIO’s mind and for good reasons companies have spent more than $4 billion on big data technologies in the year 2012. These investments will in turn trigger a domino effect of upgrades and new initiatives that are valued for $34 billion for 2013. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
[14] [15] [16] [17] [18]
Colin White, BI Research, January 2012:MapReduce and the Data Scientist. Anthony M. Middleton, Ph.D. LexisNexis Risk Solutions Date: May 24, 2011HPCC Systems:Introduction to HPCC US Department of Energy Labs, and TerraEchos, Inc. www.terraechos.com hhtp://workloadoptimization.com IBM Systems and Technology Group (Solution Brief; TerraEchosKairos on IBM PowerLinux servers) http://www.paraccel.com/technology/paraccel-analyticplatform.php http://www.greenplum.com www.ibm.com/software/analytics/cognos http://www.pentaho.com/ http://community.pentaho.com/ http://www.cloudera.com Google, Inc.: Dremel Interactive Analysis of WebScale Datasets Sergey Melnik, AndreyGubarev, Jing Jing Long, Geoffrey Romer,Shiva Shivakumar, Matt Tolton, Theo Vassilakis http://forum.gridgain.com/index.html http://hadoop.apache.org/ http://incubator.apache.org/drill/ http://www.zettaset.com/ http://hortonworks.com/
377