Fault Tolerant State Management for High-Volume Low-Latency Data Stream Workloads K.B.Muralidharan, G.Santhosh Kumar
M.Bhasi
Department of Computer Science Cochin University of Science and Technology Kochi, India
[email protected],
[email protected]
School of Management Studies Cochin University of Science and Technology Kochi, India
[email protected]
Abstract- One of the major challenges in performing incremental computations on parallel distributed stream processing systems is in the implementation of a mechanism for passing state values across successive runs. One approach is to enhance the granularity from record-at-a-time processing to processing at micro-batch level. A contrasting approach is to follow the record-at-a-time semantics and ensure scalability by means of distributed state management. Both approaches, however, require observing high degree of fault tolerance. In this paper, we study the problem of process state management against non-terminating data stream workloads for low-latency computing using the micro-batch stream processing approach. We attempt to examine methods that could yield optimum levels of state retentions with high degree of fault tolerance for typical processing workloads and propose a three-pronged approach to harness the demand. Keywords-Data stream processing; state management; fault tolerance; micro-batch processing I. INTRODUCTION
Often it is required to process streams of data that arrive continuously like click streams, log data, network traffic, realtime data from social media, and web crawl data. MapReduce(MR) is a good fit for processing massive scale data owing to its simple programming model and inherent parallel distributed architecture [6]. Availability of the very elegant MR libraries and utilities makes the choice more viable. A natural challenge for the users of MR to use it beyond the batch processing use cases would be to make changes in the system [15] to run (i) interactive analysis on the massive source data where early-and-frequent yet ball-parking estimates are much useful than a precise result at the end of the process, and (ii) a non-terminating data stream source. Both use cases demand lower latencies for various operating stages of the workflow for its ultimate success. It is then obvious that the intermediate results (states) may be retained through the system and reloaded for each batch (record-at-a-time or set of records) of the incoming stream [8]. The state may be retained in memory, alternatively on disc or both. Continuous streams equipped with the capability of merging new data into the master dataset where as in the case
of HDFS, means inducting new files to the master dataset folder using HDFS APIs [3, 4]. It is always a challenging proposition to keep the system going even under failures. There needs to be so much of efforts and planning expended at the application layer such as managing replicas, managing queues etc., in order to recover from (something) crashes albeit some vicious humanintroduced errors that are also heavier stuffs to tackle with. In short, the long running systems have high demands for faulttolerance. In this paper we propose approaches, on one hand, to minimize the state management overheads by carefully choosing strategies based on the use cases to be run on largescale workloads. Complementing the strategy, the matching fault-tolerance build-ups are also suggested. In section II we outline our approach under the three sub-sections. II.
STATE MANAGEMENT FOR STREAMS
We propose a strategy derived from three major aspects that influence the low latency processing of data streams using MapReduce framework. As a frrst step, we categorize the typical stream processing workloads based the sequencing pattern of query operators in it and identify the optimal parallelization scheme for the cases. Next, we look at a mechanism to retain the state across successive runs minimizing the size of state information, repatitioning them so as to be memorized in a collection of nodes and able to combine downstream along the computation flow. Finally, we devise the recovery mechanisms that need to be put in place for guaranteeing high degrees of fault tolerance.
A. Query parallelization based on operator-oriented clustering Some of the popular use cases that run against typical highvolume workloads include frequent word counts, distributed grep, web crawling, access log analysis, deduping, document clustering and inverted indexing. The query expressions for the use cases often contain stateless operators (filter, map, union) as well as operators Goin, aggregation) that retain state information. Filters enable splitting of input stream into streamlets that satisfy the filter predicates. Map operators are useful in projecting the input schema to a required output
978-1-4799-5461-2/14/$31.00 ©2014 IEEE 2014 International Conference on Data Science & Engineering (ICDSE)
24
r--------l - - - -...1....1 1\18,0 »
I I -----I..1
FilterO
1\'lapO » FilterO
1
node 0
We therefore come up with a strategy of sub-dividing the logical query expression into sub-expressions based on a set of guidelines as follows:
I
I I I node lV-II
-------_-.1
Fig. 1. Deployment of query bundle with stateless operators only
schema. Flavours of join operators are specifically applied for mixing of heterogeneous streams along the process path. Analytical outputs invariably require applying one or more types of aggregators at later stages along the process path. Therefore, it is hypothesized that any successful attempts to distribute the operators across a number of processing nodes (within map as well as reduce stages) would pay a huge dividend in tenns of scalability and reduced latencies. Obviously, the benefits of distributed processing would largely depend on the mutual-independence between processing chunks of data within as well as between the operators.
•
Each sub-expression would entail at most one stateful operator.
•
A sub-expression would be created with stateless operators alone if and only if there is no more stateless operation to be performed down the stream.
•
All stateless operators that are factored in between two stateful operators may be augmented to either the left-side or right-side sub-expressions owing to an optimal management of the load incurred.
•
A sequence of stateless operators may alone be chosen for a sub-expression based on the load balancing criteria.
Figure 2 shows the deployment of two sub-expressions being distributed across two appropriately balanced clusters.
Query contains stateless as well as stateful operators: As a compatible case we look at the queries of the simplest kind that one might run to know how often people who see certain ad end up clicking it. Apparently, such queries would require to run aggregators towards the downstream stages.
B. Micro-hatching It is a common understanding that record-at-a-time stream processing offers lower latency than processing them in small batches. This property makes record-at-a-time stream processors an automatic choice for applications like stock trading, fraud alerting etc. However, we expect that the tradeoff between latency and recovery overheads from failures would ultimately favour approaches that use micro-batched stream inputs. The stream of records is segmented into an ordered sequence of micro-batches with each batch given a unique ID. The original stream is split into micro-batches grouped by specific attribute(s). Upon failures, the affected batches need to be replayed. This implies that the amount of state information to be retained in the case of record-at-a-time stream processors would be considerably lesser than that in the case of micro-batched processors. The possibility of holding lesser state information during the execution cycles enables long-running queries on high-volume data streams to keep all its state information in memory rather than swapping in and out of the disc. This substantial benefit further supports our choice of sending micro-batched inputs to the processing engine. The option to retain all state information in memory simply enhances the scalability of the system.
Attempts to maximize the utilization of resources for each machine endorse using lesser number of clusters.
c.
The stateless operators are inherently amenable for higher degrees of distribution. However, the presence of stateful operators brings certain constraints on the distribution plans and its parallel execution within the clusters. Practically, there can be a few cases of distinction among the use cases based on the query patterns, as illustrated below.
Query contains only stateless operator(s): One example use case is inverted indexing where the map operator emits {(URL, term)} as the key-value tuples. Subsequently, the filter operator collects all tuples grouped by the URL and sends them to some (temporary or persistent) store. Other possible candidate use cases that go to the same group are log analysis (of the basic kind), input validators, and grep utilities. The later members may even be expressed through a solitary type of operator, viz. filters. In order to leverage optimal set of resources we could attempt to run this type of queries on a large number of machines configured into a single cluster as shown in figure 1.
1- - - - - - - ~I:_;;-l
~
Suh-nprl
I
1
~
1
,-- --- - - - - I
~
Suh-npr2
SlIb-fl'P"I