Big Data Processing Stacks

7 downloads 76309 Views 681KB Size Report
Big Data. Processing Stacks. Big Data. Sherif Sakr, King Saud bin Abdulaziz University for ... The past few years have seen the rise of big data processing stacks.
Big it iNData SMaRt CitiES

Big Data Processing Stacks

Sherif Sakr, King Saud bin Abdulaziz University for Health Sciences, Saudi Arabia

The past few years have seen the rise of big data processing stacks enhanced with domain-specific, optimized, and vertically focused features. The author analyzes these processing stacks’ capabilities and describes ongoing developments in this domain.

T

he radical expansion and integration of computation, networking, digital devices, and data storage has generated large amounts of data that must be processed, shared, and analyzed. For example, Facebook generates more than 10 petabytes of log data monthly, and Google processes hundreds of petabytes per month. Alibaba generates tens of terabytes in daily online trading transactions. This collected information is growing, and the explosive increase of global data in all 3Vs (volume, velocity, and variety) has been termed big data. According to IBM, we are currently creating 2.5 quintillion bytes of data every day (https://www-01.ibm.com/software/data/

34

IT Pro January/February 2017

bigdata/what-is-big-data.html). IDC predicts that the worldwide volume of data will reach 40 zettabytes by 2020, 85 percent of which will be unstructured data in new types and formats—including server logs and other machine-generated data, data from sensors, social media data, and many other sources (www.emc.com/about/news/ press/2012/20121211-01.htm). In practice, these conditions represent a new scale of big data that has been attracting a lot of interest from both the research and industrial communities, which hope to create the best means of processing, analyzing, and using this data. In principle, data is a key resource in the modern world. However, it is not useful in and of itself.

Published by the IEEE Computer Society

1520-9202/17/$33.00 © 2017 IEEE

Tajo

Data has utility only if meaning and value can be extracted from it. Therefore, continuous, increasing efforts are devoted to producing and analyzing it to extract this value. In principle, big data discovery enables data scientists and other analysts to uncover patterns and correlations by analyzing large volumes of diverse data. Insights gleaned from big data discovery can provide businesses with significant competitive advantages, such as more successful marketing campaigns, decreased customer churn, and reduced loss from fraud. Therefore, it is crucial that these large, emerging data types be harnessed to provide a more complete picture of what is happening in various application domains. Consequently, the increasing demand for large-scale data processing and data analysis applications has triggered the development of novel solutions from industry and academia. In the current era, data represent the new gold, with analytics systems representing the machinery that analyzes, mines, models, and mints it. For roughly a decade, the Hadoop framework has been the de facto standard of big data technologies; it has been widely used as a popular mechanism for harnessing the power of large computer clusters. However, with the increasing demands and requirements of various big data processing applications (big graphs, big streams, big SQL, big machine learning, and so on), both the research and industrial communities have recognized various limitations in the Hadoop framework.1,2 It has become apparent that Hadoop’s original design and implementation cannot serve as a one-size-fits-all solution for all big data processing problems. Therefore, the Hadoop big data processing stack has been equipped with various extensions to deal with new demands. In addition, new big data processing stacks, such as Spark and Flink, have been introduced to address various limitations of the Hadoop framework.

Hadoop Stack The Hadoop project was introduced as an open source Java library that supports data-intensive distributed applications and clones the implementation of Google’s MapReduce framework.3 In principle, Hadoop consists of two main components: the Hadoop Distributed File System (HDFS) and the MapReduce programming model (see Figure 1). Specifically, HDFS provides

Impala

Hama

Hive

Giraph

Mahout

MapReduce HDFS

Figure 1. Hadoop’s ecosystem. Hadoop allows large computations to be easily parallelized and enables the implementation of a simple and elegant fault-tolerance mechanism, but its design is not adequate for supporting real-time processing of large-scale streaming data.

the basis for distributed big data storage, which distributes data files into data blocks and stores them in different nodes of the underlying computing cluster to enable parallel data processing. The MapReduce programming model is a simple but powerful model that enables the easy development of scalable parallel applications to process vast amounts of data on large computing clusters.3 In particular, it isolates the application developer from the sophisticated details of running a distributed program, including issues of data distribution, scheduling, and fault tolerance. In this model, the computation takes a set of key-value pairs as input and produces a set of key-value pairs as output. MapReduce users can express the computation using two functions: Map and Reduce. The Map function takes an input pair and produces a set of intermediate key-value pairs. The MapReduce programming model groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function, which receives an intermediate key I with its set of values and merges them together. Typically, an output value of zero or one is produced per the Reduce invocation. This model’s main advantage is that it allows large computations to be easily parallelized and re-execution to be used as the primary mechanism for fault tolerance. Therefore, Hadoop has been widely used in various big data applications, including clickstream analysis, spam filtering, network search, and social recommendations. Hadoop has been used by several big companies, such as Yahoo and Facebook, and many companies support Hadoop commercial execution environments, including IBM, Oracle, EMC, Cloudera, MapR, and Hortonworks. Despite the Hadoop framework’s widespread success, it suffers from various limitations. For

computer.org/ITPro

35

Big Data

example, the oversimplicity of the MapReduce programming model—which relies on a rigid, one-input, two-stage dataf low—requires that users devise inelegant workarounds to perform tasks that have a different dataflow (such as joins or n stages). In addition, many programmers might prefer to use other abstract and declarative languages (in which they are more proficient), such as SQL, to express their tasks, while leaving all the execution optimization details to the backend engine. In addition, several studies have reported that Hadoop is the wrong choice for interactive queries that have a target response time of a few seconds or milliseconds.4 In particular, Hadoop has proven to be inefficient at processing large-scale structured data, and traditional parallel database systems have been doing much better in this domain. Therefore, the Hadoop stack has been enhanced by several components that are designed to tackle these challenges. For example, Hive has been introduced to support SQL on Hadoop with familiar relational database concepts such as tables, columns, and partitions.5 It supports queries that are expressed in a SQL-like declarative language, the Hive Query Language (HiveQL), which represents a subset of SQL92 and can thus be easily understood by anyone familiar with SQL. These queries automatically compile into Hadoop jobs. Impala (http://impala.io) is another open source project, built by Cloudera, to provide a massively parallel processing SQL query engine that runs natively in Apache Hadoop. It utilizes the standard components of Hadoop’s infrastructure (such as HDFS, HBase, and Yet Another Resource Negotiator, or YARN) and can read the majority of widely used file formats (Parquet, Avro, and so on). Through Impala, users can query data that is stored in HDFS. The IBM big data processing platform, InfoSphere BigInsights, which is built on the Apache Hadoop framework, has provided a Big SQL engine as its SQL interface. It provides SQL access to data that is stored in InfoSphere BigInsights and uses the Hadoop framework for complex datasets and direct access for smaller queries. Apache Tajo (http://tajo.apache.org) is another distributed data warehouse system for Apache Hadoop that is designed for low latency and scalable ad hoc queries of ETL (extract, transform, load) processes. Tajo can analyze data stored on HDFS, Amazon Simple Storage Service (S3), OpenStack 36

IT Pro January/February 2017

Swift, and local file systems. It provides an extensible query rewrite system that lets users and external programs query data through SQL. With the enormous growth in the sizes of graph datasets, the demand on scalable graph processing platforms has been increasing. For instance, Facebook has reported that its social network graph contains more than a billion users (nodes) and more than 140 billion friendship relationships (edges). In practice, large-scale graph processing requires huge amounts of computational power to analyze. In particular, graph processing algorithms are iterative and must traverse the graph in a particular way. In practice, graph algorithms can be implemented as a sequence of Hadoop invocations that passes the entire state of the graph from one step to the next. However, this mechanism is not adequate for graph processing and leads to inefficient performance because of the associated serialization and communication overhead. Apache Giraph has been introduced as an open source project that supports large-scale graph processing and clones the implementation of Google’s Pregel system.6 Giraph provides an implementation for the Bulk Synchronous Parallel (BSP) programming model, which supports a native API specifically for programming graph algorithms using a “think like a vertex” computing paradigm. BSP is a parallel programming model that uses a message-passing interface (MPI) to address the scalability challenge of parallelizing jobs across multiple nodes; the computation on vertices is represented as a sequence of supersteps with synchronization between the nodes participating at superstep barriers. Each vertex can be active or inactive at each iteration (superstep). Giraph was initially implemented by Yahoo. Facebook has since developed its Graph Search facilities using Giraph. Giraph runs graph processing jobs as maponly jobs on Hadoop and uses HDFS for data input and output. Like Giraph, Apache Hama (https://hama.apache.org) is another BSP-based implementation project designed to run on top of the Hadoop infrastructure. However, it focuses on general BSP computations, not just graph processing. For example, it includes algorithms for matrix inversion and linear algebra. Machine learning algorithms represent another type of application that is iterative in nature. The Apache

Spark job

Step HDFS (a) (Hard disk)

Step RAM (Memory)

Step RAM (Memory)

Step RAM (Memory)

Hadoop job

Step HDFS (b) (Hard disk)

Step HDFS (Hard disk)

Step HDFS (Hard disk)

Step HDFS (Hard disk)

Figure 2. (a) Spark framework vs. (b) Hadoop framework. Spark takes the concepts of Hadoop to the next level by loading the data in distributed memory and relying on less expensive shuffles during data processing.

Mahout project has been designed to build scalable machine learning libraries using the Hadoop framework. Several applications and domains—including financial markets, surveillance systems, manufacturing, smart cities, and scalable monitoring infrastructure—include a crucial requirement to collect, process, and analyze big streams of data to extract valuable information, discover new insights in real time, and detect emerging patterns and outliers. For example, the world’s largest stock exchange, the New York Stock Exchange (NYSE), reported trading more than 800 million shares on a typical day in October 2012. As another example, IBM reported that, by the end of 2011, about 30 billion RFID tags were in circulation, each of which represented a potential data generator (www.dotgroup.co.uk /wp-content/ uploads/2014/11/Harness-the-Power-of-Big-DataThe-IBM-Big-Data-Platform.pdf). Hadoop’s design is not adequate for supporting real-time processing of large-scale streaming data.

Spark Stack The Spark project (http://spark.apache.org) has been introduced as a general-purpose big data processing engine that can be used for many types of data processing scenarios. Spark, writ-

ten in Scala, was originally developed in the AMPLab at the University of California, Berkeley. It was made open source in 2010 as one of a new generation of dataflow engines following the line of the Hadoop framework. In particular, Hadoop introduced a radical new approach based on distributing data when it is stored and running computation where that data is. However, one of its main limitations is that it requires that the entire output of each map and reduce task be materialized into a local file on the HDFS before it can be consumed by the next stage. This materialization step allows for the implementation of a simple and elegant checkpoint/restart faulttolerance mechanism, but it dramatically harms system performance. Spark takes the concepts of Hadoop to the next level by loading the data in distributed memory and relying on less expensive shuffles during data processing. Figure 2 illustrates the main architectural differences between the Spark and Hadoop frameworks. In Spark, a function represents the fundamental unit of programming; its fundamental data abstraction is called a resilient distributed dataset (RDD). These RDDs represent a logical collection of data partitioned across machines that are created by referencing datasets in external storage systems or applying coarse-grained transformations (such

computer.org/ITPro

37

Big Data Spark SQL

GraphX

Spark streaming

MLib

Spark R

Spark core engine

Figure 3 shows an overview of the stack for the Spark big data processCassandra HDFS Other sources Amazon S3 ing framework. In addition to the Spark core API, various libraries Figure 3. Spark’s ecosystem. In addition to its core API, Spark are part of the Spark ecosystem and supports several libraries and provides additional functionalities for provide additional functionalities for big data processing. big data processing. In particular, Spark provides various packages with higher-level libraries, including supas filter, map, reduce, or join) to existing RDDs. port for SQL queries,7 streaming data, machine An RDD is an in-memory data structure, which gives power to Spark’s functional programming learning,8 statistical programming, and graph proparadigm by enabling user-defined jobs to exploit cessing.9 These libraries increase developer prothe loaded data into a cluster’s memory and query ductivity and can be seamlessly combined to create it repeatedly. The resilient features of RDDs encomplex workflows. For example, SparkSQL intesure that if data in memory gets lost, it can be recgrates relational processing with Spark’s functionreated with the available metadata information. al programming API.7 It bridges the gap between In addition, users can explicitly cache an RDD the two models by providing a DataFrames API in memory across machines and reuse it in multhat can execute relational operations on both extiple parallel operations. Specifically, RDDs can ternal data sources and Spark’s built-in distributed be manipulated through operations such as filter, collections.7 SparkSQL relies on an extensible opmap, and reduce, which take functions in the protimizer, called Catalyst, that supports adding data gramming language and ship them to nodes on sources, optimization rules, and data types for dothe cluster. This simplifies programming commains such as machine learning. plexity because the way applications manipulate GraphX is a distributed graph engine built on RDDs is similar to how local data collections are top of the Spark framework.9 GraphX extends manipulated. Spark’s RDD abstraction to introduce the resilient In addition, Spark relies on a lazy execution distributed graph (RDG), which associates records model for RDDs. In particular, RDD data is not with vertices and edges in a graph and provides processed and materialized until an action is pera collection of expressive computational primiformed. For cluster management, Spark supports tives. The GraphX RDG leverages advances in both Apache Mesos and Hadoop YARN. Spark distributed graph representation and exploits the can interface with a wide variety of distributed graph structure to minimize communication and storage implementations, including the HDFS, storage overhead. Spark is also equipped with an the Cassandra NoSQL database, and Amazon extension API that adds support for continuous S3. Spark provides APIs for various programming stream processing. In particular, Spark streamlanguages, including Scala, Java, Python, and ing relies on the micro-batch processing mechaR. During the 2014 annual Daytona Gray Sort nism, which collects all data that arrives within a Challenge (http://sortbenchmark.org), which certain time period and runs a regular batch probenchmarks the speed of data analysis systems, gram on the collected data. While the batch proSpark strongly outperformed Hadoop and was gram is running, the data for the next mini batch able to sort through 100 terabytes of records in is collected. Therefore, it can be considered as a 23 minutes, whereas Hadoop took more than batch-processing mechanism with a controlled three times as long (approximately 72 minutes) to time window for stream processing. execute the same task. Currently, Spark has more than 500 contributors from more than 200 orgaFlink Stack nizations, making it the most active project both Apache Flink (https://flink.apache.org) is anothin the Apache Software Foundation and among er distributed in-memory data processing framebig data open source projects in general. Popular work. It represents a flexible alternative to the distributors of the Hadoop ecosystem (Cloudera, Hadoop framework that supports both batch and Hortonworks, and MapR, for example) are curreal-time processing. Instead of Hadoop’s map rently including Spark in their releases. and reduce abstractions, Flink uses a directed Apache YARN

38

IT Pro January/February 2017

Apache Mesos

Table 1. Feature summary of Hadoop, Spark, and Flink. Feature

Hadoop

Spark

Flink

Year of origin

2005

2009

2009

Place of origin

MapReduce (Google) Hadoop (Yahoo)

University of California, Berkeley

Technical University of Berlin

Programming model

Map and reduce function over key-value pairs

Resilient distributed datasets (RDDs)

Parallelization contracts (PACT)

Data storage

Hadoop Distributed File System (HDFS)

HDFS, Cassandra, and others

HDFS, Amazon Simple Storage Service, and others

Execution engine

Yet Another Resource Negotiator (YARN)

YARN and Mesos

Nephele

SQL support

Hive, Impala, and Tajo

SparkSQL

N/A

Graph support

N/A

GraphX

Gelly

Streaming support

N/A

Spark streaming

Flink streaming

graph approach that leverages in-memory storage to improve runtime execution performance. Flink can run as a completely independent framework or on top of HDFS and YARN. It originated from the Stratosphere research project, which began at the Technical University of Berlin in 2009 before joining Apache’s incubator in 2014.10 Recently, Flink has become a top-level project at the open source Apache Software Foundation. In principle, Stratosphere uses a richer set of primitives than Hadoop, including ones that allow the easy specification, automatic optimization, and efficient execution of joins. It treats user-defined functions (UDFs) as first-class citizens and relies on a query optimizer that automatically parallelizes and optimizes big data processing jobs. Stratosphere offers both pipeline (interoperator) and data (intraoperator) parallelism. In particular, Stratosphere relies on the parallelization contracts (PACTs) programming model,10 which represents a generalization of map/reduce based on a key-value data model and the PACTs concept. A PACT consists of exactly one second-order function, called an input contract, and an optional output contract. An input contract takes a first-order function with taskspecific user code and one or more datasets as input parameters. The input contract invokes its associated first-order function with independent subsets of its input data in a data-parallel fashion. Figure 4 shows an overview of the layers for the Flink big data processing framework. The Flink system is equipped with the Flink Streaming API, an extension of the core Flink API for high-throughput and low-latency datastream processing. The system can connect to and process datastreams

Meteor

Flink Gelly

Flink ML

PACT Nephele Apache YARN HDFS

Amazon EC2 Amazon S3

Other sources

Figure 4. Flink layers. The Flink system is equipped with the Flink streaming API, an extension of the core Flink API for high-throughput and low-latency datastream processing.

from various data sources (such as Flume or ZeroMQ) where datastreams can be transformed and modified using high-level functions similar to those provided by the batch-processing API. In addition, the Flink open source community recently developed libraries for machine learning (FlinkML) and graph processing (Flink Gelly). Table 1 summarizes the features of the Hadoop, Spark, and Flink stacks.

Pipelining Frameworks One of the most common scenarios in big data processing is that users need to be able to break down the barriers between data silos such that they can design computations or analytics that combine different types of data (structured, unstructured, stream, graph, and so on) or jobs. To tackle this challenge, several frameworks have been introduced to build pipelines of big data processing jobs. Table 2 summarizes the features of these various frameworks.

computer.org/ITPro

39

Big Data Table 2. Features summary of pipelining frameworks. Feature

TEZ

MRQL

Cascading

Crunch

Execution engine

Hadoop

Hadoop, Spark, Flink, and Hama

Hadoop

Hadoop and Spark

Pipeline definition

Edges

SQL

Operators

Operators

Pipeline connection

No

SQL

Source and target

Branches, joins, and mapTo

Programming languages

Java

SQL

Java virtual machine-based languages

Java and Scala

Apache Tez (https://tez.apache.org) is a generalized data processing framework. Tez allows building dataflow-driven processing runtimes by specifying a complex, directed acyclic graph (DAG) of tasks for high-performance batch and interactive data processing applications. In Tez, data processing is represented as a graph in which the vertices of the graph represent data processing, and the edges represent the movement of data between the processing elements. Tez uses an event-based model to communicate between tasks and the system and between various components. These events are used to pass information such as task failures to the required components, whereby the dataflow moves from output to input. In Tez, the output of a Hadoop job can be directly pipelined to the next Hadoop job without requiring the user to write the intermediate results into HDFS. In case of any failure, the tasks from the last checkpoint will be executed. In general, Tez is designed for frameworks such as Hive and Pig and not for developers to directly write application code for execution. In particular, when using Tez along with Pig and Hive, a single PigLatin or HiveQL script will be converted into a single Tez job and not into a DAG of Hadoop jobs. However, execution of a DAG of Hadoop jobs on a Tez can be more efficient than its execution by Hadoop because Tez applies dynamic performance optimization mechanizers that use real information about the data and the resources required to process it. Apache MRQL (https://mrql.incubator.apache. org) is another framework that has been introduced as a query processing and optimization framework for distributed and large-scale data analysis; it’s built on top of Apache Hadoop, Spark, Hama, and Flink. MRQL provides a SQLlike query language that can be devaluated in four independent modes: MapReduce mode using Apache Hadoop, Spark mode using Apache 40

IT Pro January/February 2017

Spark, BSP mode using Apache Hama, and Flink mode using Apache Flink. However, further research and development is still required to tackle this important challenge and facilitate the job of users. Apache Crunch (https://crunch.apache.org) is a Java library that provides implementing pipelines that are composed of many user-defined functions and can be executed on top of Hadoop and Spark engines. Apache Crunch is based on Google’s FlumeJava library11 and is efficient for implementing common tasks such as joining data, performing aggregations, and sorting records. Cascading (www.cascading. org) is another software abstraction layer for the Hadoop framework that is used to create and execute data processing workflows on a Hadoop cluster using any Java virtual machinebased language (Java, JRuby, Clojure, and so on); it hides the underlying complexity of the Hadoop framework.

A

fter in-depth research, a McKinsey report pointed out that big data has created value in the US healthcare, US retail, EU public sector administration, global personal location data, and global manufacturing markets.12 The report also shows that big data has played a primary role in economic functions and in improving the effectiveness and productivity of the public sector and enterprise, creating valuable benefits for consumers. For roughly a decade, the Hadoop framework has dominated the big data processing world. However, with increasing big data processing demands and requirements, recently, we have witnessed the rise of competing processing stacks with various engines that are vertically focused and optimized for tackling specific problems and application domains. In addition, various pipelining frameworks have been introduced to

march • april 2016

nov ember • december 2015

IEEE INTERNET COMPUTING jaNUaRy • fEbRUaRy 2016 INTERNET ECONOMICs

ClOUd sTORaGE

MEasURING ThE INTERNET

ExPlORING TOMORROw’s INTERNET

ThE INTERNET Of YOU

VOl. 20, NO. 1

www.COMPUTER.ORG/INTERNET/

www.COMPUTER.ORG/INTERNET/

www.COMPUTER.ORG/INTERNET/

VOl. 20, NO. 3

VOl. 20, NO. 4

www.COMPUTER.ORG/INTERNET/

www.COMPUTER.ORG/INTERNET/

VOl. 20, NO. 2

vOl. 19, NO. 6

October 9, 2015 3:26 PM

Read your subscriptions through the myCS publications portal at http:// mycs.computer.org.

May • jUNE 2016

jUly • aUGUsT 2016

MaRCh • aPRIl 2016

NOvEMbER • DECEMbER 2015

Cover-1

Sherif Sakr is a professor in the Department of Health Informatics at King Saud bin Abdulaziz University for Health Sciences, Saudi Arabia, and is affiliated with the University of New South Wales, Australia, and Data61/ CSIRO. His research interests include graph data management, big data storage/processing in cloud computing environments, and data science. Sakr received a PhD in computer and information science from Konstanz University, Germany. He is an IEEE senior member. Contact him at [email protected].

IEEE INTERNET COMPUTING

IEEE INTERNET COMPUTING

IEEE INTERNET COMPUTING

IEEE INTERNET COMPUTING

IC-19-06-c1

july • augus t 2016

1. S. Sakr, A. Liu, and A.G. Fayoumi, “The Family of MapReduce and Large-Scale Data Processing Systems,” ACM Computing Surveys, vol. 46, no. 1, 2013, article no. 11. 2. S. Sakr et al., “A Survey of Large Scale Data Management Approaches in Cloud Environments,” IEEE Comm. Surveys & Tutorials, vol. 13, no. 3, 2011, pp. 311–336. 3. J. Dean and S. Ghemawa, “MapReduce: Simplified Data Processing on Large Clusters,” Proc. 6th Conf. Symp. Operating Systems Design & Implementation (OSDI), 2004, pp. 10–10. 4. A. Pavlo et al., “A Comparison of Approaches to Large-Scale Data Analysis,” Proc. 2009 ACM SIGMOD Int’l Conf. Management of Data, 2009, pp. 165–178. 5. Y. Huai et al., “Major Technical Advancements in Apache Hive,” Proc. 2014 ACM SIGMOD Int’l Conf. Management of Data, 2014, pp. 1235–1246. 6. G. Malewicz et al., “Pregel: A System for Large-Scale Graph Processing,” Proc. 2010 ACM SIGMOD Int’l Conf. Management of Data, 2010, pp. 135–146. 7. M. Armbrust et al., “Spark SQL: Relational Data Processing in Spark,” Proc. 2015 ACM SIGMOD Int’l Conf. Management of Data, 2015, pp. 1383–1394.

IC-20-01-c1

IC-20-03-c1

IC-20-02-c1

Cover-1

january • february 2016

References

8. E.R. Sparks et al., “MLI: An API for Distributed Machine Learning,” Proc. IEEE Int’l Conf. Data Mining, 2013; http://ieeexplore.ieee.org/document/6729619/. 9. J.E. Gonzalez et al., “GraphX: Graph Processing in a Distributed Dataflow Framework,” Proc. 11th Usenix Symp. Operating Systems Design and Implementation (OSDI), 2014, pp. 599–613. 10. A. Alexandrov et al., “The Stratosphere Platform for Big Data Analytics,” Int’l J. Very Large Databases, vol. 23, no. 6, 2014, pp. 939–964. 11. C. Chambers et al., “FlumeJava: Easy, Efficient DataParallel Pipelines,” Proc. 31st ACM SIGPLAN Conf. Programming Language Design and Implementation, 2010, pp. 363–375. 12. J. Manyika et al., Big Data: The Next Frontier for Innovation, Competition, and Productivity, tech. report, McKinsey Global Inst., June 2011.

may • june 2016

enable the creation of workflows that can be executed over multiple engines. Therefore, these big data processing stacks can evolve from being competitors of or replacements for each other into augmentable or complementary engines that can efficiently deal with complex big data processing requirements. For example, the recent Cloudera distributions support Hadoop with Spark over the same ecosystem. I believe that this direction will progressively continue to further support the wider vision of big data processing technologies.

Cover-1

Cover-1

December 7, 2015 1:45 PM

April 13, 2016 8:45 PM

February 11, 2016 10:30 PM

Want to know more about the Internet?

This magazine covers all aspects of Internet computing, from programming and standards to security and networking.

www.computer.org/internet

computer.org/ITPro

41