Open Source Initiatives and Frameworks Addressing ... - IEEE Xplore

2 downloads 0 Views 217KB Size Report
Sarwar Jahan Morshed. Linnaeus University, Sweden. Daffodil International University,. Bangladesh [email protected]. Juwel Rana. Telenor Group, Oslo ...
2016 IEEE International Parallel and Distributed Processing Symposium Workshops

Open Source Initiatives and Frameworks Addressing Distributed Real-time Data Analytics Sarwar Jahan Morshed

Juwel Rana

Marcelo Milrad

Linnaeus University, Sweden Daffodil International University, Bangladesh [email protected]

Telenor Group, Oslo, Norway Linnaeus University, Växjö, Sweden [email protected]

Department of Media Technology Linnaeus University, Växjö, Sweden [email protected]

Abstract—The continuous evolution of digital services, is resulting in the generation of extremely large data sets that are created in almost real time. Exploring new opportunities for improving the quality of these digital services, as well as providing better-personalized experiences to digital users are two major challenges to be addressed. Different methods, tools, and techniques existed today to generate actionable insights from digital services data. Traditionally, big data problems are handled on historical data-sets. However, there is a growing demand on real-time data analytics to offer new services to users and to provide pro-active customers´ care, personalized ads, emergency aids, just to give a few examples. Spite of the fact that there are few existing frameworks for real-time analytics, however, utilizing those for solving distributed real-time big data analytical problems stills remains a challenge. Existing real-time data analytics (RTDA) frameworks are not covering all the features that requires for distributed computation in real-time. Therefore, in this paper, we present a qualitative overview and analysis on some of the mostly used existing RTDA frameworks. Specifically, Apache Spark, Apache Flink, Apache Storm, and Apache Samza are covered and discussed in this paper. Keywords—Real-time; data analytics; big data; streaming data; data analytics framework; distributed real-time data analysis.

I.

INTRODUCTION

Big Data analysis in real-time is becoming a very relevant and hot topic for improving traffic and public transportation, municipal and utility services, emergency services, and so on [2]. Traditional big data processing frameworks such as PostgreSQL, Hadoop, etc. are not designed to deal with realtime data handling that is generated from multiple services. Under these circumstances, an alternative approach called distributed stream data processing framework has been introduced to deal with this problem [17]. To address this problem, a number of real-time streaming data analytics frameworks have been developed. Most of them come from open source communities; Apache Spark and Apache Flink are just an example of those. The frameworks are utilized by several digital service providers and gained lately a lot of attention from many industrial actors. For example, video streaming service provider Netflix is using Spark as one of their core analytical frameworks. Because of different framework for real-time distributed computing, the development of the distributed realtime streaming data based applications and tools become a challenging issue, both for application developers as well as for data analysts.

978-1-5090-3682-0/16 /16 $31.00 © 2016 IEEE $31.00 © 2016 IEEE DOI 10.1109/IPDPSW.2016.152

Below we present some of the main challenges with existing distributed RTDA frameworks [9-12]: • Splitting analytical problems among the nodes in a RTDA framework • Handling failures in different worker nodes during any stage of the data processing • Handling straggler nodes (which, are not failed but comparatively slower nodes) • Dealing with slow network connectivity during any stage of the data processing • Writing scripts for distributed stream data processing for different platforms In this paper, we present the result of our qualitative review on existing distributed RTDA frameworks. The findings of this paper provide new insights for selecting proper distributed RTDA framework to solve specific real-time data analytics jobs. The deployment of different RTDA frameworks is not a trivial task, as it requires significant amount of computational power, memory and storage capability. For this reason, the aim of this paper is to providing a brief overview on RTDA existing approaches to select the right framework for a given specific case. The rest of the paper is organized as follows. Section II proposes qualitative metrics for evaluating distributed RTDA. Section III classifies the major features that are covered by each of the RTDA frameworks and proposed areas for improvements. A brief discussion and a comparison of the different RTDA frameworks is illustrated in section IV. Section V presents our future research scope on both the RTDA tools and frameworks together with our conclusions.

II. EVALUATING CRITERIA FOR DISTRIBUTED REALTIME DATA ANALYTICS FRAMEWORKS Different initiatives have been identified towards developing distributed RTDA frameworks, from both opensource and commercial data analytics product developers. From our investigation, we found that most of these frameworks are not self-sufficient to support all kind of realtime analytics needs both within research, industry and public sector. Besides, to our knowledge there are not standard set of criterion to evaluate RTDA frameworks independent of they are distributed in nature or not. Even it is not clearly identified 1481

whether these frameworks support distributed real-time applications. Therefore, in this paper we propose a list of initial metrics that could be useful to measure different features of RTDA frameworks. These metrics are presented below and have been elaborated based on work carried out by [6-9]: Real-time Vs Near Real-time: Applications development using distributed RTDA frameworks should be capable to process real-time stream data by providing event level granularity, since some streaming data are real-time and some of them are near real-time. Event level granularity provides the actual realtime data processing. Combining Ability with Historical Data: Real-time streaming data uses always one-pass rather than multi-pass computation. But real-time data analytics often required history data especially for real-time machine learning based approaches. Therefore, it becomes important for the distributed real-time data analytics systems for having the ability to use history data. Stream processing: It is important for a RTDA framework to be multi-pass streaming system instead of One-pass1 streaming system. MapReduce is for example great at one-pass computation, but inefficient for multi-pass algorithms. Multipass computation is often required for delayed incoming data, node failure, etc.

Contributing Community of developers and users: Most of the distributed RTDA frameworks are open source. If several large organizations are involved with the development of the frameworks, it becomes matured and could support most of the distributed real-time features for streaming data. III.

A. Spark Apache Spark [14] treats streaming as the fast batch processing. Spark is suitable for stateful computations and where exactly once delivery is required regardless of higher latency. Supported Real-time Distributed Features of Spark • Spark provides a rich SQL interface (e.g., Spark-SQL) and standard query languages (e.g., Hive). It also provides Data frame known as Domain Specific Language (DSL) in order to query structured data. • Spark data source API ranks one of the best APIs for streaming data source integration. Almost all the data sources such as NoSQL databases, parquet, ORC, etc. can be integrated easily with the Spark.

Primitive Data Sharing: Efficient primitives for data sharing should be offered by the frameworks. How data moves between the nodes during data processing is an important issue to address. High data mobility is a key to retaining low latency. Blocking operations and passive processing elements can decrease the data mobility. Flexible Time Windowing: Support for windowing (i.e. clustering of stream data based on some function of time) is another evaluation criterion for distributed real-time frameworks. Since all data are not always necessary, windowing is necessary for stream data analysis. Distributed RTDA frameworks should be capable of partitioning stream data as well as handling it in parallel. API Support: Number of APIs is one of the measurement criterions of the distributed real-time data analysis frameworks. These frameworks should offer wide range of open APIs which could allow software developers to create rapid streaming data based applications. Standard Library for Big Data: Big data applications do not have sufficient libraries of common algorithms. Distributed real-time data processing frameworks should have the support for a wide range of standard algorithms’ libraries for processing real-time distributed Big Data. Handling Stream Imperfections: The messages in a stream can be delayed, can come out of order, and can be duplicated or lost. So the distributed streaming data analytics frameworks should be capable of processing delayed and erroneous incoming data.

EXISTING DISTRIBUTED RTDA FRAMEWORKS

There are a couple of distributed real-time data analytic frameworks and tools to analyze Big Data in real-time. Some of these frameworks and their features are presented bellow:

• Ability of Machine Learning (ML) efficiently is one of the popular features of Spark. Its memory caching and relevant implementation features are powerful enough to use ML algorithms in Spark based distributed RTDA applications. • Although Spark is implemented using Scala, it also offers APIs in Java, Python and R. So developers have more language options for developing platform independent distributed applications. Area of Improvement in Spark • Data streaming is classified as real-time and near realtime. Spark provides mini batches that do not actually represent event level processing, i.e. Spark is suitable for near real-time data processing application. • Windowing the batches depends on the process time. Therefore, flexible windowing is inadequate in Spark since it deals with mini batches of data. • Although Machine Learning algorithms are cyclic data stream, it is stated as the direct acyclic graph within the Spark. Generally distributed processing systems or tools do not encourage having cyclic data flow for reasoning complexity about them. B. Flink Apache Flink [4] offers data processing to be conducted within the database by building analytic logic into the database itself. It becomes one of the favorable platforms for distributed real-time streaming data application development.

1

In one-pass model, stream data is processed at a time while in multi-pass model, stream data is processed through one or more intermediate steps.

1482

Supported Real-time Distributed Features of Flink • Streams generated from distributed system and devices do not appear sequentially. Previously, this situation of disorderly stream arrival was managed by the software developer of the distributed application. Flink is the pioneer open source tool that manages disorder stream data arrival. • A number of useful streaming data APIs available for Scala and Java are used in Flink. Flink also supports widely used operators from different batch processing APIs and stream based jobs like windowing, splitting and connecting. • Sessions and unassociated windows are supported by Flink. But windows are embedded and integrated with the internal methods in many of these streaming tools. In contrast, Flink disassociates the windows and fault tolerance from each other. • Flink supports steady state updates while dealing with the existence of failures. This is also frequently called exactly-once processing. It also facilitates continuous data flow across the defined sources and sinks. For example, it supports steady data flow between Kafka [1] and HDFS [12].

Thus, the choice of programming language in Flink is limited. C. Storm Apache Storm [13] generates a graph of real-time computation and it is named as topology. This graph is feed into the cluster of the nodes. Nodes are classified as master nodes and worker nodes in the cluster. Master nodes distribute the code between worker nodes for execution. Storm is useful for quick event processing system which permits for increasing computation. Supported Real-time Distributed Features of Storm • Storm performs in distributed manner. Data is passed into the connecting the nodes. • Storm is suitable for high speed event processing system that offers increasing computation. • Since Storm uses Apache thrift, it supports any JVMbased languages such as Java, Python, Scala, Perl, etc. • Storm is stateless. Area of Improvement in Storm • A durable data source is required for exactly once processing

• Hueske et al., [16] found in his investigation that the Flink is a low latency and high throughput distributed system. They clocked Flink at the rate of 1.5 million events in every second for each core. Flink supports mapping the dilemma of latency-throughput.

• A reliable data source is required for at least once processing

• Flink can be integrated with HDFS, Kafka, HBase, etc. for taking data as input and output. It also performs as the execution engine in different frameworks such as Cascading, Google Cloud Dataflow, etc.

• For consistent state and/or to support exactly once delivery, higher level Trident API is required

• Flink can be executed in several platforms. Running in a local IDE makes Flink-based application development and debugging easier. • Flink supports controlled cyclic dependency graph during run time. This allows Flink based applications to introduce machine learning algorithms very efficiently and thus allows iterative processing in native platform to get higher scalability and performance.

• In order to offer additional guarantees, an untrustworthy data source can be enveloped.

D. Apache Samza Samza [3] is suitable to deal with a system with large amount of states. Supported Real-time Distributed Features of Samza: • Samza's stream primitive is considered as a message rather than a tuple. Samza divides the streams into partitions in orderly sequence. An individual partition represents an ordered sequence of read-only message with distinctive ID. • The execution of Samza and streaming modules are both pluggable, although Samza depends on YARN.

Area of improvement in Flink • The ability of combining historical data with real-time streaming data is important. Flink does not have an API that can share same abstraction for handling historical data and streaming data.

• External storage is required by Samza like Storm based applications while maintaining big amount of states in order to process arriving tuples. Besides, local disk based key/value store is used by Samza based applications to carry out the tasks.

• Until now, Flink has the Flink Table API for supporting DSL data frame. But it is not yet matured enough.

• In order to deal with large numbers of state in a small memory based machine, Samza uses similar machine for co-locating storage and processing tasks. Samza also provides flexibility to use pluggable APIs.

• As of now, Flink depends significantly on the MapReduce input format for integrating data sources. Therefore, data source integration in Flink is limited.

• Additionally, Samza can integrate multi-language based modules. Therefore, modules developed using different languages and by several developers can be add or remove easily.

• For the Flink implementation; Java is used. It also offers only Scala APIs for application development.

1483

Area of Improvement in Samza • Apache Samza supports only the JVM-based languages. Currently it supports Scala and Java. • State access in Samza is not transparent at all. • There is requirement of exactly-once semantics in Samza. Table I below presents a summary of the most popular RTDA frameworks and a comparison among those.

API Support Standard Library for Big Data Handling Stream Imperfections Contributing Community

Flink

Real-time VS Near Real-time Combining Ability with Historical Data Stream Computation Primitives of Data Sharing Flexible Time Windowing

Strom

Evaluation Metrics

Samza

COMPARISON OF FOUR POPULAR RTDA TOOLS

Spark

TABLE I.

Near realtime Supported

Real-time

Real-time

Real-time

Supported

Limited Supported

Supported

Multipass Supported

Multipass Supported

One-pass

One-pass

Supported

Supported

Supported

Supported

Highly supported Supported

Limited supported Supported

Limited supported Supported

Highly supported Highly supported Highly Supported Supported

Supported

Supported

Supported

Large

Medium

Medium

IV.

REFERENCES [1] [2] [3] [4] [5]

Highly Supported Small

DISCUSSION

Several open source frameworks have already been developed for real-time Big Data analytics. However, these frameworks are not self-sufficient to solve all kind of data analytical jobs. For this reason, it is important to conceptualize and develop a proper benchmark approach that contains the right information for finding the appropriate RTDA framework. Referring back to previous section, Table 1 provides importation information in order to gain insights on existing RTDA approaches. Batch or historical data analytics fundamentally differs from real-time data analytics. While batch data analytics are not covered in this paper, rather the focus is remained on satisfying real time analytics by providing in-depth overview on evolving RTDA frameworks. The initial results presented in this paper are based on a qualitative study. We argue that performing a complementary quantitative study for the similar objective would be highly valuable. However, within the scope of this paper, the results could be seen as a starting set of recommendations for researchers and analytic-professionals that can guide their strategies for their real-time data analytical platforms. V.

contributing to the frameworks discussed in this paper. Our investigation found that no individual framework or tool does yet provide enough functionalities to support all features for real-time Big Data analysis as well as for developing distributed real-time data analytics applications. Therefore, more research and development efforts are needed to improve these frameworks. One of the future directions related to the ideas presented in this paper is to carry out a quantitative study for comparing the different RTDA frameworks and their performance. We have also identified the areas of improvement in these frameworks. Therefore, another direction of our future work is to contribute in improving some of these open-source tools.

[6] [7]

[8]

[9]

[10]

[11] [12]

[13] [14] [15]

[16]

[17]

CONCLUSION AND COMING EFFORTS

Distributed real-time data analytics has emerged as a hot research issue for both academy and industry, as well as the public sector. Many individual and organizations are

1484

N. Garg, “Apache Kafka” Packt Publishing Ltd., 2013. R. Kitchin, "The real-time city? Big data and smart urbanism." Geo Journal 79.1 2014, pp. 1-14. Apache Samza, http://samza.apache.org/startup/. 02.12.2015. Apache Flink, https://flink.apache.org/material.html, last accessed on 28.12.2015. D. A. Nichols, P. Curtis, M. Dixon, and J. Lamping. “High-latency, lowbandwidth windowing in the Jupiter collaboration system”. In Proceedings of the 8th annual ACM symposium on User interface and software technology (UIST '95). ACM, New York, NY, USA, pp. 111120. T. Siciliani, “Streaming Big Data: Storm, Spark and Samza”, https://dzone.com/articles/streaming-big-data-storm-spark. 26.12.2015 A. Das Sarma, A. Lall, D. Nanongkai, and J. Xu, “Randomized multipass streaming skyline algorithms”, Proc. VLDB Endow. 2, 1 August 2009, pp. 85-96. S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. Patel, K. Ramasamy, S. Taneja, “Twitter Heron: Stream Processing at Scale”, In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, New York, USA, pp. 239250. M. Stonebraker, U. Çetintemel, S. Zdonik, “The 8 requirements of realtime stream processing”, SIGMOD 15, interenational conference on management of data, Dec 2005, pp. 42-47. X. Gao, E. Ferrara, J. Qiu, "Parallel Clustering of High-Dimensional Social Media Data Streams," 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), , 4-7 May 2015 pp. 323-332. A. Das, J. Gehrke, M. Riedewald, “Approximate Join Processing Over Data Streams”, In ACM SIGMOD Conference, June 2003. S. Kamburugamuve, “Survey of distributed stream processing for large stream sources”, Technical report. http://grids.ucs.indiana.edu/ptliupages/ publications/survey_stream_processing.pdf. 2013. N. Marz, “Storm: Distributed and fault-tolerant realtime computation”, http://storm-project.net/, Feb 2013. Apache Spark, http://spark.apache.org/, 08.12.2015. The Hadoop Distributed File System: Architecture and Design. http://hadoop.apache.org/docs/r2.5.1/hadoop-projectdist/hadoophdfs/HdfsDesign.html 02-01-2016. F. Hueske, M. Peters, Matthias J. Sax, A. Rheinländer, R. Bergmann, A. Krettek, K. Tzoumas, “Opening the black boxes in data flow optimization”, Proceedings of the VLDB Endowment, v.5 n.11, July 2012, pp.1256-1267. A. Kejariwal, S. Kulkarni, K. Ramasamy, “Real time analytics: algorithms and systems” Proc. VLDB Endow. 8, 12, August 2015, pp. 2040-2041.

Suggest Documents