Hadoop as Big Data Operating System – The Emerging Approach for Managing Challenges of Enterprise Big Data Platform Sourav Mazumdar IBM Software Group San Jose, CA, USA
Subhankar Dhar San Jose State University San Jose, CA, USA
[email protected]
[email protected]
Abstract— Over last few years, innovation in Hadoop and other related Big Data technologies in last few years brings on to the table a lot of promises around better management of enterprise data at much lesser cost and with high value business benefits. In this paper, we propose to delve into details of these challenges from practitioners’ perspective based on lessons learnt from various Big Data implementation scenarios. We also aim to discuss the emerging concept of Hadoop as Big Data Operating System to address these challenges with a holistic proposition. Finally, we also plan to provide a prescriptive approach based on best practices which can help moving towards the vision of Enterprise Big Data Platform using Hadoop as Data Operating System balancing between short term objectives and long term goals of managing and maintaining Enterprise Big Data Platform. Keywords— Business analytics, Big Data, Data Mining, Map Reduce, Hadoop, NoSQL I.
INTRODUCTION
Innovation in Hadoop and other related Big Data technologies so far has created a lot of promises around better management of enterprise data at much lesser cost and use of the same for high value business benefits. Over past few years a growing number of organizations have actually started using Hadoop and other Big Data technologies for implementing variety of enterprise grade applications. However, managing Big Data environment at enterprise level can be an involved task both from cost and operation perspectives. Supporting various types of enterprise data use cases with different workload patterns in the same cluster, minimizing the data movement in and out of cluster, assuring different SLAs across various use cases and user groups, ensuring data lineage and veracity, security and data privacy are some of the key challenges every enterprise is trying to address in their individual stage of establishing Enterprise Big Data Platform.
The enterprise IT practitioners, who are in charge with enterprise data, is right now trying to figure out how to create and maintain a scalable and sustainable Enterprise Big Data Platform to reap the promised benefits of Big Data in a predictable and repeatable manner. II.
WHAT IS ENTERPRISE BIG DATA PLATFORM
Adoption of disruptive technology like Big Data in an organization is not just about the applicability of the technology for the current and futuristic use cases but also about growing a complete ecosystem around that. Along with identification and application of right technology choices it also means nurturing new set of skills, modifying/enhancing processes and tools around the usage of data, managing multiple hardware infrastructures, developing partnership with right set of technology vendors (and system integrators) and most importantly aligning and on-boarding different stakeholders within the organizations to sponsor and support the new set of initiatives and projects. In case of Big Data technologies all of these can be summarized as Enterprise Big Data Platform [11,12,15]. III.
TOP FIVE CHALLENGES INVOLVED IN ENTERPRISE BIG DATA PLATFORM
Adoption of any new technology brings in challenges to an organization on the basis of maturity, fear of unknowns and the classic ‘resistance to change’. Big Data technologies have come long way last few years addressing many of these challenges successfully which has influenced a steep jump in number for Big Data adoptions in multiple organization in 2013 [1,5]. However with the more number of adoptions, people have started finding new set of challenges as they have started envisioning and implementing new type of use cases taking the capabilities of Big Data technologies to the stretch.
Figure 1: A high level view of Enterprise Big Data platform covering different aspects of the same Here is the list of top five challenges as experienced today by different organization in various junctures in their Big Data journey [2, 3, 9, 11]. 1. Support for various types of use cases with different SLAs for performance and throughput – Owing to the ground breaking success of running Batch use cases and Online non-transactional /non-relational CRUD use cases using Big Data technologies (also known as NoSQL use cases) people have also started considering use of the Big Data technologies for other use cases. These other use cases can be broadly categorized as - a) Interactive and Real Time (in seconds) queries [16], b) Real Time Streaming Data processing [14], c) Search [11] and d) Machine Learning & Natural Language Processing. As a real life example of this need we would like to mention the Big Data requirement that exists with Telematics data in Auto Insurance space. Many of the auto insurance providers today want to run high frequency Telematics data ingestion process, cleansing of the same, using Telematics data for Modeling for insurance quote and showing aggregated Telematics data as the driving behavior to the end customer in a Big Data infrastructure. Similar desire exists in other industries too (like Utility with Ami data, Retail with RFID data etc.) who deal with high frequency data. 2. Integration with existing tools and technologies in Enterprise – Most of the Big Data technologies were invented in Web based product companies to solve focus problems related to handling data at large scale. So many a times they suffer from lack of integration with other software products used in typical enterprise environment. The major needs are around a) Integration with reporting application using SQL interface over ODBC/JDBC [14, 17, 19] b) Integration capability with ETL software, which help in data ingestion from various sources [14, 17] c) Integration with tools providing Statistical and Predictive modeling capability [12, 18, 19]. Corporations of different size in Insurance, Retail and Financial organizations already have significant
investments in ETL tools (like Informatica, Data Stage), Modeling tools (like SAS, SPSS, R), Reporting tools (like Business Objects, Cognos, OBIEE etc.). As our experience goes with their requirements in Big Data space they surely want easy integration of those tools with Big Data technologies. 3. Skills – One of the biggest premise of Big Data Technologies is ‘One size does not fit all’. But at the same time this is making Big Data Technologies less adoptable. There are so many new technologies and promises emerging every day in Big Data space that it is hard to keep track of [1,5], not to mention the dearth of affordable skilled resources in market for them. Also reuse of the existing enterprise skill is of big importance. Most of the traditional IT shops who have pressing need for Big Data (like Finance, Insurance, Retail, Media etc.), have typically run their Data Warehouse/Business Intelligence with skills like SQL, Java, Report design etc. which are comparatively easy to get in the market. Big Data Technologies need to provide reasonable support for these types of in house skills. 4. Data Security, Privacy and Governance – This is the feature of paramount importance in case of any data use cases. However ensuring the same for big volume of data and especially on non-structured data is not trivial. What makes this a unique challenge in Big Data Platform is that many a times the ownership and visibility requirement of same data may vary for various types of users and use cases [9,11]. Financial and Insurance Industries are most particular about these requirements. However, other industries also need to ensure this to reasonable extent for the legal requirements. 5. Consolidated Infrastructure – – Many a times Big Data technologies need separate infrastructure/clusters to run in production. This not only creates unwanted data movement across the clusters which can cause data staleness but also causes management and operational challenges. Also sometimes it can cause the clusters to be underutilized yielding lesser ROI. So having a consolidated infrastructure in production maintaining
required SLAs and priorities across disparate types of use cases is key [11, 13, 14]. Many of the Retail, Consumer Electronics, Media and Gaming companies today run with petabyte scale of data as their business thrives on high
volume of customers and high number of activities performed by those customers. For these companies running number of Big Data clusters in production is next to impossible.
Figure 2: A high level view of Top 5 challenges involved in creating an Enterprise Data Platform IV.
THE VISION OF HADOOP AS BIG DATA OPERATING SYSTEM
In past few years the work happening in Hadoop ecosystem in open source as well as licensed distribution showed a significant promise to address these requirements. The vision is to have Hadoop ecosystem to go beyond the Map Reduce processing paradigm and allow other programming paradigm to operate on same HDFS cluster addressing the above mentioned requirements in a scalable manner [11,14,16]. This makes Hadoop as Big Data Operating System a natural choice for becoming the heart of the Enterprise Big Data Platform. Here is the layered view of the Vision of Hadoop as Big Data Operating System abstracting the key separation of concerns and thereby ensuring scalability, flexibility and interoperability as needed in Enterprise Big Data Platform [3,4,6,7,8]. 1. Low Cost Hardware Cluster Layer – This layer is fundamental to the Enterprise Big Data Platform promoting a low cost single cluster vision for storing and processing all Big Data of an enterprise. 2. Distributed File System Layer – This layer provides the foundation at the File system level which is scalable for any size of file, flexible for any type of data stored in the file, and reliable with built in high availability constructs for any volume of data. Many choices available today in the industry for the implementation of the same - be it open source Apache HDFS implementation, IBM’s GPFS-FPO, Intel’s Luster, MapR’s distribution or Brisk with different value added features. All of them comply with the common HDFS API ensuring that they all provide basic Quality of Service guaranteed in HDFS contracts making them easy to integrate with. 3. Resource Management Layer – This layer ensures resource management across multiple types of use cases so that every type of use cases can get its own share of resources ensuring priorities and SLAs. Here again different options are available right now with different level of maturity – IBM Platform Symphony, YARN, Mesos etc.
4.
5.
6.
The Distributed Processing Layer – This layer welcomes different processing paradigm to operate on HDFS platform other than Map Reduce. This addresses the need to run non Map Reduce programs in same HDFS cluster reaping of the core benefits of HDFS like scalability, reliability and flexibility for structure agnostic data. Industry has already seen many processing engines coming out to support this be it generic engine like Spark, Tezz etc or Engine for specific solution like Hoya for HBase, IBM BigSQL engine, Cloudera Impala engine, etc. This layer also helps addressing Data Privacy and Security as every application can now develop its own set of access control mechanism. Component Layer – This layer aligns to different Enterprise Big Data use cases bridging between the Distributed Processing Layer and API Layer. The components in this Layer can be classified broadly under a) OLAP Relational use cases – Batch, Interactive, Real Time; b) Online non-Relational use cases/Graph – popular as NoSQL solutions; c) Real Time Stream Data Processing - Complex Event Processing on streaming data; d) Machine Learning and Natural Language Processing – Supervised and Unsupervised modeling of Quantitative as well as Qualitative data ; and finally e) Search use cases – Discovery and relating various structured and unstructured data in Enterprise. Again today in the industry plenty of solutions are available at various phases of maturity for each of these use cases which are running (or can potentially run) on HDFS platform. And many more to come. This is the layer for innovation. API layer – This layer is about consolidating the access APIs for the components running on various Distributed Process Engines. The three major APIs, which are already emerging as the defacto choice, aligned with typical enterprise skills, are – SQL, Java and R. The SQL is defacto standard in enterprise data world for last few decades. So most of the initiatives today in the industry (IBM’s BigSQL, Cloudera Impala, Hive 1.3, Hive on Spark etc.) are trying to make SQL support as strong as
7.
possible. R is the Machine Learning/Statistical Modeling interface emerging big way in last few years in the industry as well as academics. Many of the Hadoop framework’s (IBM’s BigR, SparkR etc.) today is working towards making R integrated efficiently with Hadoop. Python is also picking up in the same space. Java has already become defacto programing language for most of the enterprise applications. In Big Data space the same can be used for Steaming, Machine Learning, Natural language Processing, Search etc. – the use cases which need more flexibility for specific applications. The Common Service Layer – This layer encapsulates other common services needed in any enterprise grade
data use cases – Meta Data management, Security, Workflow, Data Mapping & Lineage, etc. Solutions are emerging in the industry today which can work on Hadoop in these areas. The Distributed Processing Engine(s) in Figure 3 used by each component is indicated as the circles of same color as of the corresponding Distributed Processing Engine. Table 1 summarizes the layers of Hadoop as Enterprise Big Data Operating System and the requirements of Enterprise Big Data Platform addressed by the same.
Figure 3: Different layers of Hadoop as Big Data Operating System Table 1: Hadoop as Big Data Operating System and the Requirements Addressed
Figure 4: Moving towards Enterprise Big Data Platform using Hadoop as Big Data Operating System internal stakeholders. This can also provide necessary V. HOW TO MOVE TOWARDS THE VISION OF HADOOP AS inputs towards the success of first (and critical)/pilot Big ENTERPRISE BIG DATA OPEARTING PLATFORM Data project in the enterprise. This can help too in identification of new use cases for the enterprise which As discussed in the previous section the vision of Hadoop as were otherwise not thought of using the traditional data Big Data Operating System comprises of various components technologies. in different layers. However they are at different levels of 2. Ongoing Technology Research Projects – Enterprise maturity ranging from conception stage to being actually used needs to keep on doing ongoing Technology Research in production. Nevertheless, the potential of using the same as projects to keep track of the maturity of the Big Data heart of Enterprise Big Data Platform is very compelling and components and APIs applicable for the enterprise the level of energy and investment (in open source situation. This will help also to prioritize use of communities, licensed product vendors, and user communities Components and APIs for production projects. Having like various enterprise) observed today in industry for moving Product vendor part of this team is important to get the towards the same is very promising [4,6,7,8]. necessary commitment of availability (and maturity) of In our previous paper we have discussed three best the required Big Data technologies. practices for enterprise adoption of Big Data [5]. The same 3. Ongoing Data Science Exploration Projects – These holds true even for moving towards the use of Hadoop as Big small scale projects should be geared towards Data Operating System for Enterprise Big Data Platform. identification of new business opportunities and values However in addition to following those best practices there is (‘Data Products’) using the data in Enterprise Big Data also a need to adopt a strategic approach for execution which platform as well as addressing existing Business needs. can protect the immediate investments in Big Data projects The outputs/learnings from these projects can be used in with the assurance of marching towards the long term goal. sequencing Big Data projects in production. We recommend a strategic approach for the same as 4. Big Data Environment Preparation projects – Big Data summarized in the Figure 4. The participation needed by environment preparation is typically iterative and various stakeholders in each step is described as circles in evolving work as it needs learning of new Big Data each step and corresponding colors are provided in the technologies impacting the Big Data infrastructure and Legend. also procuring hardware/cloud infrastructure etc. which The key steps involved in this approach are – typically needs time. Typically inputs (on business need) 1. Big Data Stampede Project – This is a kick of project an from Business critical requirement need to go to enterprise can execute along with Big Data Product Infrastructure project and in an opposite way inputs (on vendor and key internal stakeholders (like IT ready ness) from Infrastructure preparation project has to Development Team, IT Infrastructure Team, Data Science come for feature prioritization of Business critical Team etc.) to jump start the Big Data journey (or align the projects to go in production. The balance between these existing one). This project should have a good two has to be worked out carefully. combination of Business Use Cases and Technical Use 5. Production Project prioritization – This is an iterative case proving the readiness and viability of Big Data exercise too which has to be performed on ongoing basis technology in the enterprise context. This can provide to be aligned with balance of most matured technology necessary training, experience and confidence to all
6.
and most effective technology for a set of business value use cases. The Technology Stampede Project track, Research track, Data Science Exploration track and Infrastructure track provides necessary inputs. Ongoing Production projects – These are the actual production projects to be executed in series or parallel. The first/pilot project is very critical to establish confidence on viability. Learning from these projects needs to be assimilated in all other tracks.
Various organizations across different industries like Insurance, Media and Gaming, Consumer Electronics Manufacturing, Retails etc. have been adopting this approach in some form or fashion. Formalizing (and customization) of the same in the structured manner detailed above would be of immense help to reap the required benefits out of the Big Data technologies. CONCLUSIONS In the last few years, with new set of Big Data tools, technologies and infrastructure available at our disposal, it has become much easier to capture, store and analyze all types of data, structured/semi-structured/unstructured in the enterprise. This helped Big Data technologies to gain considerable attention due to its potential not only to transform data mining/business analytics practices but also because of its ability to drive a wide range of highly effective new generation use cases which can help creating new business opportunities. This makes the future of Big Data technologies look very promising. However, enterprise adoption of Big Data is still evolving. There are several new requirements (and hence challenges) are coming up with the more and more adoption of Big Data Technologies in enterprises which need to be addressed before Big Data technologies will attain its full potential. In this paper, we have discussed the potential of using Hadoop as Big Data Operating System as the heart of Enterprise Big Data Platform. The key characteristics in various layers of Hadoop as Big Data Operating System are discussed. We have also proposed methodology for moving towards Enterprise Big Data Platform keeping Hadoop as Big Data Operating system as heart of it. This can provide the guiding principles for researchers, practitioners and various organizations that are looking for solutions in this area. We hope that this work will provide key insights to the design and implementation of Enterprise Big Data Platform. REFERENCES [1]
Christian, B., Boncz, P., Brodie, M.L., and Erling, O. 2011. “The Meaningful Use of Big Data: Four Perspectives – Four Challenges,” in
Proceedings of the Twenty-five Semantic Web and Database researchers met at the 2011 STI Semantic Summit, Riga, Latvia. [2] Dr. Eric A. Brewer, “Towards Robust Distributed Systems”, http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf [3] IBM BigInsights. http://www03.ibm.com/software/products/en/infobigienteedit [4] Cloudera. 2013. Cloudera Manager Enterprise Edition 4.5.x Release Notes (http://www.cloudera.com/content/cloudera-content/clouderadocs/CM4Ent/4.5.1/Cloudera-Manager-Enterprise-Edition-4.5.xRelease-Notes/Cloudera-Manager-Enterprise-Edition-4.html) [5] Dhar, S. and Mazumder, S., Challenges and Best Practices for Enterprise Adoption of Big Data Technologies, Journal of Information Technology Management, Vol 25, No, 4, December 2014. [6] Foley, M. and Shah, H. 2012. “Deploying and Managing Hadoop Clusters with Ambari,” Hadoop Summit [7] Cynthia M. Saracco, Daniel Kikuchi, and Thomas Friedrich, “Developing, publishing, and deploying your first Big Data application with InfoSphere BigInsights” http://www.ibm.com/developerworks/data/library/techarticle/dm1209bigdatabiginsights/index.html?ca=dat [8] John Gantz and David Reinsel, “The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East” http://idcdocserv.com/1414 [9] Stonebraker, M., “Big Data, Big Problems,” 2011. Communications of the ACM, Volume 55 Issue 2. [10] Harris, H., What is a “Data Product”?, March 31, 2014, http://meetings2.informs.org/wordpress/analytics2014/2014/03/31/whatis-a-data-product [11] Demchenko, Y. ; de Laat, C. ; Membrey, P., “Defining architecture components of the Big Data Ecosystem”, International Conference on Collaboration Technologies and Systems (CTS), 2014 , pp 104- 112. [12] Vera-Baquero, A. ; Colomo-Palacios, R. ; Molloy, O., “Business Process Analytics Using a Big Data Approach”, IT Professional , Volume 15, Issue: 6, 2013 , Page(s): 29 - 35. [13] Rubén Casado and Muhammad Younas, “Emerging trends and technologies in big data processing”, Concurrency Computat.: Pract. Exper. 2014. [14] Wei Kuang Lai · Yi-Uan Chen, “Towards a framework for large-scale multimedia data: storage and processing on Hadoop platform”, J Supercomput (2014) 68:488–507. [15] Wuheng Luo, “Enterprise Data Economy: A Hadoop-Driven Model and Strategy,” , Interntional Confrence on big data, 2013. [16] A. Ghazal, T. Rabl, M. Hu, F. Raa, M. Poess, A. Crolotte, H. Jacobsen, “BigBench: towards an industry standard benchmark for big data analytics”, Proceedings of the ACM SIGMOD International Conference on Management of Data, 2013, pp 1197-1208. [17] Kala Karun. A, Chitharanjan. K, “A Review on Hadoop – HDFS Infrastructure Extensions”, Proceedings of IEEE Conference on Information and Communication Technologies (ICT), 2013. [18] S. Das, Y. Sismanis, K. S.Beyer, R. Gemulla, P. J. Haas, John McPherson, “Ricardo: Integrating R and Hadoop”, Proceedings of the ACM SIGMOD International Conference on Management of data, 2010, pp. 987-998. [19] F. Özcan, D. Hoa, K. S. Beyer, A. Balmin, C.J. Liu, Y. Li, “Emerging Trends in the Enterprise Data Analytics:Connecting Hadoop and DB2 Warehouse”, Proceedings of the ACM SIGMOD International Conference on Management of data, 2011, pp. 1161-1164.