hadoop best practices: single versus multiple instances

0 downloads 137 Views 267KB Size Report
Hadoop, to eliminate the bottlenecks and reduce dependence on set data ... looking at customer experiences with big data
Document P41 March 2015

RESEARCH NOTE

HADOOP BEST PRACTICES: SINGLE VERSUS MULTIPLE INSTANCES

T H E B OT T O M L IN E Customers in business intelligence and analytics are turning to data management platforms to handle expanding data sets, preventing bottlenecks in accessibility while maintaining high standards of data integrity. Nucleus has found that customers are strategizing as to how to cost effectively use data management resources, such as Hadoop, to eliminate the bottlenecks and reduce dependence on set data locations. In looking at customer experiences with big data, Nucleus found that customers achieved greater value from deploying and scaling Hadoop in one instance with all their data sets rather than deploying individual, disparate instances for each data set. ANALYSTS: Rebecca WETTEMANN John DROTAR

T H E S O LU TI ON Hadoop is an open-source data distribution framework that provides resources for customers to manage data across multiple data storage types and locations. Hadoop connects with big data storage platforms, linking them together and running analytics to guide relevant data to the processes that need it. The framework eliminates the risk of location outages on data accessibility, ensuring that customers have access to vital data sets at all times. Hadoop consists of: 

Hadoop Distributed File System (HDFS). HDFS divides files and data sets into blocks and distributes them across data nodes connected together into clusters.



Hadoop Map/Reduce. Map/Reduce lets customers view and process data to generate command codes that are then sent to the individual nodes that have the needed data for carrying out the coded activity.



Hadoop Ecosystem. Hadoop has additional software packages that can be used on top of or alongside Hadoop such as Apache Hive for big data analysis on HDFS and compatible file systems, Apache Pig for creating Map/Reduce programs, and Apache Spark for open-source data analytics on top of HDFS for faster analytics than Map/Reduce.



Yet Another Resource Navigator (YARN). YARN extends Map/Reduce’s resource management capabilities to other analytics engines and streamlines Map/Reduce to

Nucleus Research Inc. 100 State Street Boston, MA 02109

NucleusResearch.com Phone: +1 617.720.2000

March 2015

Document P41

process data. YARN also lets users run multiple applications in Hadoop that share the same resource management and memory but that manage different resources. Companies are facing challenges from the increasing size and scale of data sets as more workplace processes are being automated and capable of producing data. Hadoop lets customers understand their data sources and manage accessibility across multiple locations. The data node clustering within a data source also reduces dependence on one individual data storage location, reducing the risk of hardware failures as well as the amount of time data is inaccessible in the event of an outage. End users can also code analytical queries right into their data sets and map routes for analytics engines to access relevant data without complex searches. The mapping and low-level analytics reduce big data bottlenecking caused by multiple analytical systems, engaging the data at once and performing searches after the data sets have been extracted. To better understand the strategies for maximizing returns from Hadoop, Nucleus analysts conducted in-depth interviews with organizations with big data strategies.

T W O H A D OO P S T R AT EG I ES Nucleus found that companies using Hadoop typically took one of two approaches when planning their Hadoop deployments:



Customers that chose to scale Hadoop with their data storage were using Hadoop over a unified database, and were extending it to reach all of the data nodes and clusters as the amount of data storage space increased. These customers planned early around larger initial data sets that required a farther-reaching solution for data management and increased accessibility.



Customers that chose to scale Hadoop separately from their database expansions had implemented individual instances of Hadoop over smaller database clusters within their entire data storage system. As a result, not all data storage expansions pertained to the data sets overlaid with Hadoop, and if they did, the existing instance of Hadoop could be extended to the additional nodes or clusters with smaller scalability requirements.

COS T - B EN E F IT TR A D E OF FS Nucleus found that while customers taking the single unified approach made a larger initial investment, they achieved benefits over time as they grew their data sets and integrated more solution deployments. They relied more heavily on Hadoop for data visibility, understanding, and accessibility from a centralized location. These customers realized benefits including: 

Reduced costs and staff. Customers scaling Hadoop with their data storage needed fewer staff numbers to meet the growing analytics needs because they were

© 2014 Nucleus Research, Inc. Reproduction in whole or in part without written permission is prohibited. Nucleus Research is the leading provider of value-focused technology research and advice. NucleusResearch.com

Page 2

March 2015

Document P41

managing only one instance of the platform. As a result, these customers were able to reduce the cost of scaling Hadoop as they expanded their data storage capacity. 

Reduced cost of integrations. Managing one instance across a large and diverse data set also eliminated interfacing between instances of the platform in the process of integrating disparate solutions and databases. As a result, these customers realized smaller teams for deployment, maintenance, and coding than those that scaled Hadoop separately.



Increased analytics visibility. These customers also increased their visibility into metrics around data visualization and analytics, reducing the time needed to measure data transactions and to track data sourcing.



Increased predictability. Scaling Hadoop together with data storage allowed customers to plan the timing of the platform expansion as they planned to handle more data. As a result, these customers had greater insight as to when they would need to service the system, reducing the impact of maintenance and expansion.

Customers said:



“We have huge volumes of data and we have had a difficult time dealing with it. Our ecosystem is made up of Hadoop and several other elements with one instance of Hadoop distributed over the data storage. Users no longer go into the big data environment with IT and business analysts exploring the data in Hadoop once the nuggets of information are identified. The nuggets are then pushed to our analytics platform and the entire organization is amazed with the insight and data they’ve been able to see.”



“While the staff members that support our single instance of Hadoop have very unique skills in that they have to know Hadoop and the business data, they’ve established a business analysis competency center and the business units are all represented so expectations and processes can be accounted for and set up front so there are no complaints about performance.”

Customers that chose to scale Hadoop separately from their database expansions had implemented individual instances of Hadoop over smaller database clusters within their entire data storage system. As a result, not all data storage expansions pertained to the data sets overlaid with Hadoop, and if they did, the existing instance of Hadoop could be extended to the additional nodes or clusters with smaller scalability requirements. While these customers experienced short-term savings relative to the alternate approach, they experienced long-term costs not realized by scaling Hadoop together with their data storage. As one customer noted, “Big data was so new to the business that it was hard to get them to see the value at first. They basically felt that if you don’t know what you are missing then you don’t know the questions to ask. As a result, we deployed Hadoop only where it made sense, realizing value on a smaller level than if we could had we deployed a scalable version

© 2014 Nucleus Research, Inc. Reproduction in whole or in part without written permission is prohibited. Nucleus Research is the leading provider of value-focused technology research and advice. NucleusResearch.com

Page 3

March 2015

Document P41

over all the data. Now we can’t scale it with all of our data, only what we have pigeonholed as relevant at that point in time.” Managing Hadoop over smaller databases allowed companies to optimize their resource deployment by localizing expansion rather than expanding Hadoop every time the count of data nodes or database clusters increased. As a result, these companies realized lower labor costs in the short term but higher labor costs over the long term as they amassed larger amounts of more complex data. Similarly, as these companies grew, they needed more resources to facilitate solution integrations to connect the databases and to consolidate the instances of Hadoop. As a result, while Hadoop maintained the same levels of data accessibility locally as it did globally, scaling separately required more labor time to reconcile discrepancies across disparate data sets and Hadoop configurations. One customer found, “Our ecosystem uses Hadoop but not at an all-encompassing level. As a result, each data team manages their own instance meaning we are still integrating across instances, data, and analytics rather than just across the analytics. It takes more time and is no doubt costing us more where one instance could deliver the same results.”

CON C L US ION As companies look to effectively manage their growing data sets, smart companies are looking at long term requirements to strategize how to cost effectively manage analytics relative to data management. Nucleus found that Hadoop offers customers a strategic platform for data management that reduces labor requirements for running data processes and integrating solutions and data sets. Nucleus also found that customers achieving the most benefits from their Hadoop deployments were engaging one instance of the platform that was extended to all of their data sets. These customers either began with a large database that had surpassed their ability to internally manage the data, or had planned their deployment around reaching that point. In either case, these customers were able to reduce the labor and cost requirements of managing Hadoop while they expanded their data sets, while reducing solution and integration costs of overlaying more instances of Hadoop as their data sets expanded.

© 2014 Nucleus Research, Inc. Reproduction in whole or in part without written permission is prohibited. Nucleus Research is the leading provider of value-focused technology research and advice. NucleusResearch.com

Page 4

Suggest Documents