Real Time Streaming Data Storage and Processing ...

6 downloads 217741 Views 671KB Size Report
2016 International Conference on Advanced Communication Control and Computing Technologies ... advanced analytics can be performed by using Hadoop.
2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT)

Real Time Streaming Data Storage and Processing using Storm and Analytics with Hive D.Surekha1, G.Swamy2, Venkatramaphanikumar S3 1, 3

VFSTR University, Department of CSE, Vadlamudi, 2BigData Practice Lead, Anblicks Solutions Guntur, Andhra Pradesh, India, 522213 e-mail: [email protected]; [email protected]; [email protected] Abstract-In big data world, Hadoop Distributed File System (HDFS) is one of the famous file system to store huge data. HDFS will take care about managing and maintaining the data in distributed way. Based on research we did to discuss that how the real time streaming data can be processed and stored into Mongo DB and Hive. Big data analytics can be performed on data stored on Hadoop distributed file system using Apache Hive, Tez and Apache Presto. Hive is an ecosystem which is on top of Hadoop (Map Reduce), and provides higherlevel language to use Hadoop’s core component Map Reduce to process the data. The key benefits of this approach are it can able to store and process the large amount of data. It can also handle the millions of user requests concurrently. It can provide the scalability for the system is enhanced by adding new nodes. Integrating the Visualization tools with Big Data applications will give the big picture to the users to view the insights of the Big data. It can provide the analytic reports for giving the big picture about the system. Keywords___. HDFS, Hive, Presto, Real time streaming, Big data Analytics, Visualization.

I. INTRODUCTION “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyse [15].For handling Big data Hadoop is the best solution. Hadoop is an open source framework to store and process the huge amounts of data [1]. It is invented by Apache in 2008. The beauty of the framework is storing and processing the data in distributed way. In the beginning days, Hadoop is used to support for only batch process. Since the data is stored in sequentially. The core components are HDFS AND Map Reduce. HDFS is for storage and Map Reduce is for processing. Since it is complex to write the map reduce programs in Java and other languages to process the data. Other eco systems were invented to make it simple. Out of those Hive and Pig are the famous eco systems to avoid Map Reduce from developer point. Since Hive and Pig will have their own compilers to convert the scripts written by developers in HQL and Pig Latin languages. The converted code will be a Map reduce job and internally submitted to Map Reduce. Hadoop become famous because it can process all kinds of data. Structured, Semi structured and Unstructured. Figure1 Illustrates [2] the layers found in the software architecture of a Hadoop stack [10] [11].

ISBN No.978-1-4673-9545-8

When we see the bottom of the Figure1 there is a layer for HDFS. This will take care about storing and managing all kinds of the data. On top of HDFS we have a layer called Map Reduce which has two phases internally. Map phase will take care about segregating the input data from HDFS. And results will be return to the local file system. The second phase is called Reduce [14], which will aggregate the results of the map phase. This will satisfy the analytics of the batch process [7] [9]. For Random access support we can use HBase.

Fig 1: Hadoop Architecture Layers HBase can be used through the application to read and write the data. We can see Hive and Pig on top of Map reduce layer. We do see the ETL Tools and BI Reporting components on top of Hive and Pig. The Intention of ETL is to use the Hadoop (Hive) as warehouse which will save the time from ETL point of view and we do not need to purge the data since Hadoop does not have the limitations about storage. BI reporting will help to visualize the processed data from Hive or HBase. Figure2 describe how the advanced analytics can be performed by using Hadoop and eco systems [2]. Advanced analytics included machine learning, building the recommendation systems and prediction algorithms [4] [8].

Fig 2: Hadoop Architecture Tools and usage II. BACKGROUND

606

2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) Processing the real time streaming data is not an easy task [3]. Hadoop can take care about big data with respective of Volume and Variety and Velocity. Real time streaming data will come under Velocity category. Here we have three challenges 1. Getting the real time data into Hadoop environment, we do get the millions of the events per second 2. Parallel processing of these events in near real-time is really complex. 3. Finally we need to get something out of processing right, which is event correlation also a complex task in real time streaming data.

Fig 3: Real Time Streaming Process Figure 3 depict the Real time streaming process. The source considered as coming from mobile and web as events. All million events will be landed in distributed queue this will solve the receiving million number of events in real time. In our research we considered the Kafka as message queue. The second challenge will be taken care by Storm by receiving the real time data from Kafka. It has two components 1. Spout: This will pull the data from Kafka topic and make it available to bolt. 2. Bolt will be having the actual business logic and compute the data and store into target systems.3.Event correlation can be handled by storm itself since it can interact with any kind of databases or data stores to compare the events. Here, after event correlation we used the target system as Hive and Mongo db. So that results can be feed to target systems. To perform our research on real time streaming data we built 20 node cluster using Horton works HDP 2.3.Installation performed using Ambari server. Services included in the installation are HDFS, MapReduce2, Yarn, Hive, Zookeeper, Storm, Kafka, Presto, and Tez. In industry Kafka and Storm are the best combination to process the real time data. We had written the Kafka producer to write the data into Kafka topic from user interface. First we have to create the topics in Kafka. People will think since we have other messaging queues why do we need to choose Kafka only. The simple answer is,Kafka is a fast and scalable reliable messaging system [1]. It is better with respect of throughput, replication, and fault tolerance when we compare with other messaging queues. A. Distributed Messaging Queue landscape: Kafka is implemented by Linked and donated to Apache [17] . So that we can call it as Apache Kafka. When we talk about message queuing system, it will have three components Producer, Broker and consumer. All the traditional messaging queues are developed based on point to point paradigm. The limitation here is

when the producer pushed the messages to broker even though multiple consumers registered with broker only one consumer can receive the data. But Kafka is built based on producer and consumer paradigm. In the strategy, when the producer writes the message to broker the copy will send to all the consumers. This strategy is more efficient than point to point. B. Message Queuing approach

System:

The

Kafka

Figure 4 depicts the way of communication between multiple producers, multiple consumers and the broker in detail way. We know Kafka works on distributed way. The three daemons of the Kafka (Producer, Broker, Consumer) works as a logical group. These services can be installed in the cluster of the nodes. In place of the consumers we have different systems. These systems will be in logical group. So that one copy of the message will be available in Kafka broker, all the consumers can get the single copy. The topic will have two internal files. One is index file and other one is log file. The log file is feed by producer when the message is delivered. Instead each message is appended at the end of the topic based on certain time or number of messages these can be flushed to topic of the broker. Once these messages are delivered, messages can be exposed to the consumer.

Fig 4: Communication between the Producers and Consumers Every message is reliably processed and able to handle millions of the records per second because of above approach. Now we can see the architecture of the Storm. C. Real Time Data Processing System: Storm Apache storm [18] has the ability to work on different use cases. We more on looking for real time streaming. Other than that it is good at machine learning, distributed RPC and ETL. Strom can process million records per second on single node. We scale the cluster by adding new slave nodes. And it is fault tolerant. In Storm cluster we can run topologies. The input to the Storm is stream. This is nothing but sequence of tuples. Unbounded sequence of tuples as shown in Figure 5.

Fig 5: Unbounded Sequence of Tuples

607

2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) Storm provides the primitives for transforming one stream into a new stream in a distributed and reliable way. Below figure shows combination of Spout and Bolt. Spout is sending the data to multiple bolts. And bolt receiving the data from multiple bolts. This says we can have multiple bolts in a topology. And all are interrelated.

conf, toplogyBuilder.createTopology() );

Fig 8: Storm Structure

Fig 6: Spout, which emits tuple stream Figure 7 describes the Bolts process the tuple and emits new stream. The supervisors of the storm can support parallel processing. We can specify the parallelism in configuration files. There are tree high level entries in the storm cluster. Those are i. Worker Process ii. Executors iii. Task.

Here, the Blue Spout is the origin of the tuple stream and has the parallelism hint mentioned as two in the. Figure 9 will be the blue print of the practical view for the topology. Blue spout has been connected to the green bolt. Parallelism hint for Spout is 2 and number of tasks are 2.For green bolt Parallelism hint 2 and number of tasks are 4. Green bolt connected to yellow bolt. Yellow bolt parallelism hint is 6 and number of tasks are 6. The basic concept of storm is processing the stream of tuples in parallel way. We have to describe and implement our own Bolts and spouts. For better performance we have to define the parallelism across all the nodes. We are going detail about setting up the parallelism here. Once we have done the setup and defining the Storm topology. Next step is to read the data from Kafka using the spout. As we mention earlier Kafka can handle millions of messages. And it is more reliable. We used one bolt to write into Hive and one more to write into mongoDB.

Fig 7: Bolts process the tuple and emits new stream Subset of the topology [16] can be executed by Worker. Worker Process can run on more executors also. There could be many worker processes running from one topology. The executor is the thread spawned by Worker Process and it will run a task. The task performs the actual processing, i.e. each spouts and bolts are in the cluster. Configuring the parallelism of a simple Storm topology Config conf=new config (); conf.setNumWorkers (2); //use two worker process topologyBuilder.setBolt(“blue-spout”,new BlueSpout(), 2); //parallelism hint topologyBuilder.setBolt(“green-bolt”,newGreenBolt(), 2).setNumTasks(4) .shuffleGrouping(“blue-spout”); topologyBuilder.setBolt(“yellow-spout”,new YellowSpout(), 6).shuffleGrouping(“green-bolt”); StormSubmitter.submitTopology( “mytopology ”

Fig 9: Depiction of the total number of worker, executor, and task in the topology. III. BIG DATA ANALYSIS USING HIVE Hive is Data warehouse system for Hadoop. It run SQL-like queries that get compiled and run as Map Reduce jobs. And it displays the result back to the user. Data in Hadoop even though generally unstructured has some vague structure associated with it. Reason for

608

2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) using Hive is to ease the use of Hadoop file system and Map Reduce for non-developers. Users like scientist, analysts etc. just need to know SQL syntax. And Writing SQL is faster than writing code. Some of the Hive features mentioned below: Create table, create view, create index –DDL Select, where clause, group by, order by and joins – DML Pluggable Input output format Pluggable: User Defined functions –UDFs User Defined Aggregate Functions –UDAF User Defined Table Functions –UDTF Pluggable Serializable Deserializable libraries

Then build the spago bi query to pull the data from hive tables. We can create the reports in Spagobi studio and then deploy into the Spagobi server.

A. Configuring Hive Since Hive is depending on HDFS and Map Reduce components we should configure the IP addresses and port numbers of HDFS and Map Reduce in Hive configuration file. This is called as hive-site.xml. To update this file we can do based on the distribution you are using. If we are using Horton works we can update using Ambari Server. Other way is to update from command line by using the hiveconf parameter. We can run hive scripts in three different ways. Hive –f, when we have scripts in file. Hive –e is the other option to run hive queries at command line. Final way of running Hive query is in hive shell command mode.

Fig 10: Product Scanned Details

B. Hive Services: We can access Hive through JDBC, ODBC, and Thrift client. Web interface is the other way to access the hive. Hive service can be used to run the hadoop command with the jar option. The same as you could do directly, but with Hive jars on the class path. Lastly, there is a service for an out of process metastore. The metastore stores the Hive metadata. Three configurations you can select for metastore setup. Embedded is the first option. In this case meta store will be tied up with Hive program which is getting run. Second option is to run it as local, which keeps the metastore code running in process, but moves the database into a separate process that the metastore code communicates with. The other option we have to choose when we want to share the meatastore with external users. Hive, [13], is a data warehousing solution that has been developed by Apache Software [5] .Here we written the hive scripts nothing but queries on the data which is in hdfs and write the results into hive table. Results are nothing but summarized data or analytical data. At present the data is in hive tables. Hive can be configured with the presto for giving the faster response .Here we generate reports using analytical data which is in hive table. For Visualization we are using the Spago BI. Since we are looking for the open source BI tool. The best option we got here is Spago BI. Since it is light weight and more efficient to handle decent data from Hive. We have chosen this. Here we configured the Spagobi with the hive by using JDBC connection.

Fig 11: Total Scanned Items

Fig 12: Most Scanned Products

609

2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) For this we can collect the input parameters like product name, from date, to date, location and tag type from the user interface. Based on these parameters we can get the visualization from the SpagoBI. Here we provide analytical reports for some use cases like product scanned details, most scanned products and success/fail rate of product are shown in Fig 11 and Fig 12. IV. CONCLUSION To process the huge real time data is never an easy job. To perform this best eco system in the Hadoop world is Apache Storm. The best partner to the Apache Strom is Kafka as a message queuing system. This is reliable, scalable and works on distributed way. Once the data is in HDFS and Hive to process the data in batch way in efficient way Tez will help internally to optimize the DAG’s of map reduce jobs. For distributed querying presto is best combination with Hive. As we mentioned, after the processing, the data the results can be visualized by using the SpagoBI. It is opted since it is open source. REFERENCES [1] [2] [3] [4] [5]

[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

www.hortonworks.com www.ijrise.net www.happiestminds.com DunrenChe, MejdlSafran, and Zhiyong Peng, From Big Data to Big Data Mining: Challenges, Issues, and Opportunities, DASFAA Workshops 2013, LNCS 7827, pp. 1–15, 2013. Gruenheid, Anja, Edward Omiecinski and LeoMark.Query optimization using column statistics in hive Proceedings of the 15th Symposium on International Database Engineering & Applications-IDEAS 11IDEAS11, 2011. www.adhocshare.tk Jefry Dean and Sanjay Ghemwat,MapReduce: A Flexible Data Processing Tool,Communications of the ACM,Volume 53, Issuse.1,January 2010, pp.72-77. Carlos Ordonez, Algorithms and Optimizations for Big Data Analytics: Cubes, Tech Talks, University of Houston, USA. Jefry Dean and Sanjay Ghemwat, MapReduce: Simplifieddataprocessingonlargeclusters,Communications of the ACM, Volume 51 pp. 107–113, 2008 Hadoop,“PoweredbyHadoop,”http://wiki.apache.org/hadoop/ PoweredBy. Apache:ApacheHadoop,http://hadoop.apache.org Apache Hive, http://hive.apache.org/ The Apache Software Foundation. Hive. http://hive.apache.org/. Kyuseok Shim, MapReduce Algorithms for Big Data Analysis, DNIS 2013, LNCS 7813, pp. 44–48, 2013 BradBrown,MichaelChui,andJames Manyika,Are youreadyfortheeraofbigdata,McKinseyQuaterly,MckinseyGloba lInstitute,October 2011 http://www.michaelnoll.com/blog/2012/10/16/understandingthe-parallelism-of-a-storm-topology/ http://kafka.apache.org/design.html. http://storm-project.net/.

610

Suggest Documents