This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 1
CHive: Bandwidth Optimized Continuous Querying in Distributed Clouds Bart Theeten and Nico Janssens Abstract—Bandwidth efficient execution of online big data analytics in telecommunication networks demands for tailored solutions. Existing streaming analytics systems are designed to operate in large data centers, assuming unlimited bandwidth between data center nodes. Applying these solutions unmodified to distributed telecommunication clouds, overlooks the fact that available bandwidth is a scarce and costly resource making the telecommunication network valuable to end-users. This article presents Continuous Hive (CHive), a streaming analytics platform tailored for distributed telecommunication clouds. The fundamental contribution of CHive is that it optimizes query plans to minimize their overall bandwidth consumption when deployed in a distributed telecommunication cloud. Additionally, these optimized query plans have a high degree of parallelism built-in, benefiting speed of execution. Early experiments on data from a large mobile operator, indicate that CHive can yield bandwidth reductions upwards of 99%. Index Terms—Big Data, Cloud, Data processing, Distributed computing, Event Stream Processing, Optimization, Query processing, Telecommunications
F
1
I NTRODUCTION
B
IG DATA companies like Facebook, Google, Twitter or LinkedIn continuously collect massive amounts of data from users and their devices. According to the business model of these companies, data has become a very valuable asset to extract knowledge about user interests, social contacts, intentions, etc – for instance to drive targeted advertisement campaigns, to recommend news items or to present personalized offers. Traditional telecommunication companies, in contrast, adopt a different business model. These companies generate revenues by selling their premium communication services, including network bandwidth. However, similar to the web-scale companies listed above, telecommunication companies also have access to massive amounts of information, in particular data related to how their networks are being used. This includes both (anonymized) network traffic information and statistics about the operational status of the deployed telecommunication equipment. Processing and mining this data generates valuable insights that enable improving the operation of the affected network, including the ability to predict and prevent erroneous situations (like overloads and traffic congestion), to support dynamic network capacity planning, to perform user and user-device segmentation, and even to predict user behavior. Existing IT platforms and solutions for big data analytics are designed to operate on large clusters of processing nodes, located in the same data center (DC) [1], [2], [3], [4], [5]. Additionally, these platforms assume the availability of virtually unlimited resources, such as compute power and network bandwidth. When executing big B. Theeten and N. Janssens are with the IP Platforms Scalable Data Department at Bell Labs, Alcatel-Lucent, Antwerp, Belgium (e-mail:
[email protected],
[email protected])
data analytics in telecommunication clouds, however, these assumptions cannot be taken for granted anymore. First, telecommunication clouds tend to be highly distributed in nature, being built up as a constellation of micro DCs in the edge and/or access network. These micro DCs have the unique benefit to be located much closer to the end-user, which enables e.g. hosting lower latency services and location-aware processes. Second, if the data generation velocity is high and/or the size of the events is large, transporting this data over the network to a central DC may consume a significant portion of the available bandwidth, which overlooks that network bandwidth is a scarce and costly resource making the telecom network valuable to end-users. This article presents CHive (Continuous Hive), developed by Alcatel-Lucent Bell Labs, which offers a Hivelike [6] solution to simplify and optimize streaming analytics in telecommunication clouds. Similar to Hive, CHive aims to facilitate the execution of SQL-like queries to process massive datasets. In contrast to Hive, however, CHive is not designed to execute ad-hoc queries on large datasets stored in Hadoop [1], but instead executes continuous queries1 [7] on data collected in an online fashion. The fundamental contribution of CHive is that it optimizes query plans to minimize their overall bandwidth consumption when deployed in a distributed cloud. This is accomplished by rewriting query plans such that data events can be processed as close as possible to their source, hence limiting the amount of information that needs to be sent all the way down to the network core where the analytics applications are typically running. As an added benefit, the optimized query plans have a high degree of parallelism built1. A continuous query is a query that is repeatedly re-evaluated as new data comes in; it operates on a data stream rather then a previously stored database table.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 2
Access Domain 1 (wired,wireless)
Micro Data Center DC 1
Access Domain 3 (wired,wireless)
Micro Data Center DC 3
2.1
Edge Service Router (SR) Edge SR
Access Domain 2 (wired,wireless)
backbone
Edge SR Micro Data Center DC 2
Back Office Data Center
Fig. 1. Distributed telco cloud. in, benefiting speed of execution. Early experiments on real-life data from a large service provider, indicate that CHive can yield bandwidth reductions upwards of 99%. The remainder of this article is structured as follows. Section 2 provides background information on telco network clouds and evaluates the application of existing solutions facilitating (streaming) analytics in them. Section 3 introduces the CHiveQL high-level continuous query language. Section 4 provides technical details about the CHive architecture and describes the overall process flow of the query plan compiler. Section 5 evaluates the benefits of the CHive platform, including both a theoretical evaluation and benchmark results using real CDR data of a large service provider. Finally, conclusions and future work are presented in Section 6.
2
BACKGROUND
AND
The remainder of this section evaluates the application of existing analytics solutions to process network data.
R ELATED W ORK
Telecommunication networks are designed to enable communication between terminals. As illustrated in Figure 1, these networks are often built up in a hierarchical fashion, including access layer networks connecting to end-user homes and mobile devices, edge layer networks aggregating traffic from all access-layer nodes and finally the network core aggregating all traffic from the edge networks. Distributed telco clouds deploy micro DCs at edge and/or access layer aggregation points, hence providing multiple layers of geographically distributed processing capabilities. This facilitates hosting low-latency services close to the user as well as location-aware processes. Additionally, it enables to reduce the amount of (monitoring) data that needs to flow from the edges of the network to the network core, by pre-processing these streams on DCs located close to the affected data sources, hence exploiting the inherent parallelism in telco data streams. It is this latter bandwidth reduction property that is exploited by CHive. Bandwidth scarcity is also recognized in IoT (Internet-of-Things) networks in [8] and even within a datacenter in [9]. Their solutions are however application-specific rather than being offered as optimizations of a general-purpose query language.
Batch Processing
Batch processing tools, such as Hadoop [1] MapReduce and Apache Spark [2], offer a programming model and runtime platform to process large amounts of historically received data. As Dean and Ghemawat state in [10], users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. Apache Spark [2] is sometimes referred to as an in-memory MapReduce framework that promises 100x better performance than Hadoop. Hourglass, as another MapReduce solution, seeks to improve incremental data processing by extending the Hadoop API with an easy accumulator-based interface for the programmer [11]. Since the MapReduce programming model is very low level, tools like Hive [6], [12], [13] were invented, offering an SQL-like declarative language to be compiled into MapReduce jobs. The popularity of Hive has triggered new work describing how to optimize MapReduce queries. Wu et al. propose in [14] a query optimization scheme to be integrated in Hive, to relieve the data analyst from optimizing queries before submitting them. CHive is situated at the same level as Hive in the data processing toolchain, but instead is targeting to optimize event stream processing rather than batch processing. All solutions presented so far facilitate batch processing of historical data in a single large DC. In order to process data sets that are geographically distributed across DCs, Jayalath et al. [15] propose data transformation graphs for executing a MapReduce job as a sequence of jobs per DC, each DC running its own local MapReduce cluster. CHive offers similar benefits to event stream processing – i.e. avoiding to back-haul all event data to a single DC before processing. More specifically, CHive stores and processes raw events as close as possible to the event sources, requiring to transfer only minimal information towards the network core running the analytics front-end. 2.2
Event Stream Processing
Event Stream Processing (ESP) systems are fundamentally different from batch processing platforms in that they ”store” event processing workflows rather than data. The arrival of new data events triggers the execution of these workflows. Aurora/Borealis [16], [17] is an interesting example of an ESP system. These systems offer a programming model and dedicated runtime to facilitate data stream processing, including primitive building blocks for workflow processing, real-time scheduling support, load shedding features and dedicated storage management. Storm [3] is another well-known example
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 3
of a distributed and fault-tolerant realtime ESP middleware. From a very high level, Storm could be compared to Hadoop in terms of the non-functionals it offers, such as cluster management. However, from a functional perspective, Storm is fundamentally different in that it manages topologies of continuously running tasks rather than MapReduce jobs that are launched until completion. Storm also provides a higher-level query API called Trident [18], which can be described as the ESP counterpart of batch processing languages like Cascading [19], [20]. Although ESP systems are fundamentally different from batch-oriented systems, recent initiatives have tried to bridge both worlds. MapReduce Online [21] proposes a modified MapReduce architecture allowing data to be pipelined between operators, supporting online aggregations. More recently and in active development at UC Berkeley, Apache Spark Streaming [22] is an interesting extension to Spark adding support for continuous stream processing, by introducing discretized streams or micro batches. Both tools bring the MapReduce programming model to the event processing domain. ESP systems facilitate composing and deploying distributed event stream processing flows. To the best of our knowledge, the current state-of-the-art in ESP does not offer an SQL-like declarative API2 that enables nonprogrammers to define streaming queries without writing a single line of code. Furthermore, ESP platforms currently do not support (automated) query plan optimizations reducing a query’s overall bandwidth consumption, which are the core contributions of CHive. 2.3
Complex Event Processing
Complex Event Processing (CEP) solutions facilitate analyzing continuous event streams. Esper [23] is an open source Complex Event Processing (CEP) tool developed by EsperTech. It can operate as a standalone application or become embedded as a library in other Java applications. Esper provides a feature-rich high-level SQLlike query language. In contrast to CHive, Esper has no out-of-the-box support for running continuous queries distributed over multiple JVMs. Scaling Esper over multiple JVMs can be done only through means provided by the underlying middleware, like Storm. In Storm, an Esper engine can be embedded in each bolt3 to process events received by that bolt. However, executing a highlevel query that spans multiple bolts requires manually decomposing the query into smaller parts (one part per bolt) and later recomposing the overall partial results in a way that makes the final result semantically equivalent with the originally intended high-level query. Running multiple instances of the same bolt in parallel, such as to support scalability, similarly requires careful partitioning of the event streams and supporting recomposition at 2. Note that Apache Spark has support for SQL and HiveQL, but this is not currently available for Spark Streaming. 3. A bolt represents a processing node in a Storm flow.
the end. This is a difficult and error-prone task, which is exactly what CHive’s query optimization engine does in a fully automated way.
3
T HE CH IVE Q UERY L ANGUAGE
CHive adds high-level query language support to distributed event stream processing. The CHive query language (CHiveQL) is strongly inspired by Esper’s event processing language (EPL). EPL is an SQLlike language with SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY and LIMIT clauses [23]. To support streaming analytics, EPL replaces database tables with event streams, generating data tuples in a continuous fashion. Furthermore, EPL offers various SQL extensions supporting the expression of streaming queries, including statements to derive and aggregate information from one or more event streams, and to join or merge them [23]. As an example, EPL enables to define various types of window-based views on data streams, including time-based windows (e.g. to keep all events generated during the last 15 minutes) and fixed-length windows (to keep 10K events in memory), each in a sliding or tumbling version. To illustrate a concrete EPL streaming query, the listing below depicts an EPL query expression calculating every second the top-10 HTTP hosts generating the highest download volumes, based on network measurements collected during the last 15 minutes. This query will be used in the remainder of this article to illustrate the presented work. select http host, sum(rec bytes) as download volume from src.win:time(15 minutes) where http host ’unknown’ group by http host output snapshot every 1 sec order by download volume desc limit 10
Providing a detailed overview of EPL or CHiveQL is beyond the scope of this article. Instead, we briefly introduce the supported query primitives and elaborate on the CHiveQL extensions supporting bandwidthoptimized query deployment. 3.1
Query Plan & Primitives
In order to execute a query, a query string is transformed into a query plan, representing a workflow through query primitives. Each query primitive implements a dedicated function of the query and is characterised by having one or more input gates (to receive incoming data tuples), as well as one or more output gates (for delivering processed data tuples to a connected component). CHiveQL currently supports the following familiar SQLlike query primitives: • project – retains only those attributes of the input schema required for executing the query • filter – retains only those records/events matching the filter criteria
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 4
map – applies a function to query attribute values group – aggregates records/events into buckets • order – sorts the result set • limit – retains only a specified number of records/events in the result set • join – combines two streams using a hash join function preserving only those event combinations that meet one or more join condition(s) • union – adds two sets together into a new set In addition, CHiveQL adds a few primitives of its own: • partition – distributes events over multiple instances of the query plan’s downstream primitive according to the value of a hash function calculated on a specified attribute • broadcast – duplicates incoming events to all output gates • merge – the inverse of broadcast, aggregates events from each input gate by applying a given aggregation function. • •
3.2
Query Annotations
CHiveQL includes a set of annotations that enable a data analyst and/or monitoring system to provide hints or context information helping the CHive query compiler to calculate the optimized query plan. 3.2.1 Stream volume annotations When searching for the most optimal query plan, the compiler calculates the end-to-end stream data volume that each candidate plan is expected to generate4 . This requires knowledge about the expected output rate of each event source, and about the output-to-input ratio of each query primitive included in the plan – indicating how the size of the primitive’s output stream relates to the size of its input stream. An output-to-input ratio between 0 and 1 represent a reduction in stream volume, while a value larger than 1 represents an increase. To deduce this information from a continuous query, CHiveQL offers various stream volume annotations. As an example, the listing below depicts an annotated CHiveQL query to calculate the top-N hosts generating the largest download volumes. select http host, sum(rec bytes) as download volume from src.win:time(15 minutes) @EVENT RATE=1000 where http host ’unknown’ @PASS RATIO=0.3 group by http host @NUM GROUPS=100 output snapshot every 1 sec order by download volume desc limit 10
Using CHiveQL, every event source can be extended with an @EVENT_RATE annotation to express the source’s expected output rate (in number of events per second). Filter operations, such as WHERE and HAVING clauses, can be annotated with @PASS_RATIO tags, 4. Query plans are internally represented as weighted directed acyclic graphs (WDAGs)
hinting the fraction of events that are expected to pass the filter. Queries including aggregation window operations defined by GROUP BY clauses can include @NUM_GROUPS annotations specifying the anticipated number of groups that the affected windows will collect. Based on (a) this window’s output rate (defined in the query, every 1 second for the top-N example), (b) the expected number of groups collected in the window and (c) the expected event arrival rate (deduced from upstream primitives in the graph), the CHive compiler can calculate the output-to-input ratio of each GROUP BY operation. Finally, a JOIN operation can be annotated with @JOIN_FACTOR, specifying the associated outputto-input ratio, whose value depends on the data distribution of the join keys and on the join conditions. 3.2.2
Natural data partitioning
Additional to the stream volume hints, CHiveQL offers a dedicated annotation @PARTITIONED_ON to indicate that a stream is naturally partitioned on one or more GROUP BY keys. For instance, if we consider a system in which every stream produces events for a mutually exclusive set of HTTP hosts, then these streams are said to be naturally partitioned on the GROUP BY key http_host. With this knowledge, the inter-stream parallelism of a query plan can be improved by deploying separate aggregation windows for each stream, instead of a single one for all streams. This optimization can be applied to the Top-N example, as long as all parallel results are later put together in one set, ordered again and limited to the N highest. from src.win:time(15 minutes) @EVENT RATE=1000, @PARTITIONED ON=[http host]
3.2.3
Explicit data partitioning
If streams are not naturally partitioned on a GROUP BY key, but values for this key overlap only occasionally when originating from different event streams, then it may still be beneficial to deploy separate aggregation windows as suggested above. If so, all event streams must be partitioned explicitly, delivering messages with the same key value to the same aggregation window. This can be accomplished using well-established range or hash partitioning techniques. To indicate the need for explicit stream partitioning based on one or more GROUP BY keys, CHiveQL includes a PARTITION ON clause. This clause is similar to Hive’s CLUSTER BY clause, which specifies the output columns that are hashed on, to distribute data to Hadoop reducers [6].
4
S YSTEM A RCHITECTURE
CHive adds high-level query language support to distributed event stream processing. It is a layer that sits on top of existing Event Stream Processing (ESP) platforms, like Storm, Akka, etc. As such, CHive inherits the typical
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 5
CLI
Clients Query
Web Interface
Query Parser/Lexer
Plan Compiler
Query Deployment Engine
RQP Generator
Minimal Steiner Tree Calculator
Query Execution Library
Query Primitive Library
Storm
API OQP Generator
Network Topology Discovery Windowing Support
Query Expression
Meta Information (event type description, event source URL, etc.)
CHiveQL
Query Parser/ Lexer
Other
Scheduler
Fig. 2. CHive System Architecture non-functionals (support for high-availability, reliability, elasticity) from the underlying ESP platform.5 The remainder of this Section highlights the major components of the CHive architecture as well as the overall process flow and algorithms for query plan generation and deployment. Figure 2 depicts the CHive architecture, including the following main components: Client APIs. CHive offers various client APIs to submit CHive query expressions, network topology information, event source information (i.e. their address, name and event type name) and the schemas of event types. These APIs currently include a command line interface (CLI), a web interface (HTTP) and a Java API. Query Plan Compiler. This component generates a distributed CHive query plan to be deployed onto a given or discovered network topology. Section 4.1 explains how this plan is optimized in terms of bandwidth consumption between all data sources and the final sink. Query Execution Library. This library offers all required functionalities to execute CHive query plans, including a generic implementation of all supported query primitives. Additionally, the query execution library contains a high-performance and low-memoryfootprint windowing implementation. More details can be found in Section 4.2. Query Deployment Engine. Deploys the generated optimized CHive query plan onto the processing nodes in the network topology, as explained in Section 4.3. Query Plan Compiler
Figure 3 illustrates the overall CHive query plan generation process flow. To generate a CHive query plan, the query plan compiler requires a valid CHiveQL query expression as input, as well as meta-information describing the event streams being used. The Query Parser/Lexer parses this CHiveQL query, verifies its syntactic and 5. Note that this assumes that the underlying ESP platform can be over-layed on top of a distributed cloud, i.e. including nodes of different data centers in a single cluster. If one needs to resort to multiple instances of the ESP platform, one per datacenter, then the end-to-end reliability, availability and elasticity aspects need to be reworked. This is however outside of the scope of this document.
ALTO
Network Graph
AST
Minimal Steiner Tree Calculator
Optimal Query Plan for central execution
Query Plan Compiler
System Input
Network Topology Discovery
OQP Generator
Query Deployment Engine
4.1
Deployment Network Description
RQP Generator
Query Primitive Wrapper Topology Builder
User Input
Cheapest interconnect incl. all sources and sink
Query Plan per Node
Storm Deployment
Fig. 3. Process flow of CHive query plan compiler. semantic correctness, and transforms this string into an abstract syntax tree (AST). The Reference Query Plan Generator uses this AST to generate a Reference Query Plan (RQP), representing a query execution graph optimized for deployment on a centralized cluster or data center. This RQP minimizes the amount of information that needs to flow through each subsequent primitive. Conceptually, this stage is similar to the query plan compilation steps of traditional databases and data warehousing solutions like Hive [6]. Note that this plan will never be executed as is, since it still needs to be optimized for the network topology on which it is to be deployed. To calculate a bandwidth-optimized query plan for distributed deployment, the query plan compiler also needs a description of the target network graph. The Network Topology Discovery component is in charge of discovering – and providing access to – the network topology onto which a query is to be deployed. The network topology can either be provided by the user as a JSON-formatted document listing the network nodes and the edges interconnecting them, or it can automatically be discovered using ALTO [24], [25] servers. The Minimal Steiner Tree Calculator then calculates a minimal Steiner Tree from this network topology graph, including all data source nodes and the query sink node as terminal vertices. This Steiner tree represents the cheapest interconnect for these nodes, and will be used as the deployment tree (DT) onto which the various primitives of the query plan will be placed. In the next stage, the Optimal Distributed Query Plan Generator uses the RQP and the DT as input to calculate a Optimized Query Plan (OQP) tailored for distributed deployment. The OQP Generator maps the various primitives of the RQP onto the vertices of the DT, such that the overall bandwidth consumption is minimal. This includes, for instance, deploying various primitives of the RQP as close to each other and as close to the source nodes as possible. The ability to do so highly depends on the RQP’s degree of inter-stream parallelism, which we define as follows:
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 6
A query plan’s inter-stream parallelism represents the ability to execute individual primitives of a query plan in parallel on separate event streams without compromising the semantic correctness of the query plan – that is, yielding the same results as when executing these primitives on the union of all involved event streams. To improve the overall bandwidth consumption of the distributed query plan, the OQP Generator includes a set of substitution rules for replacing query primitives with semantic equivalents that increase the degree of interstream parallelism (see Section 4.1.1). Hence, the OQP Generator rewrites the RQP based on the actual DT. In the final stage, the compiled OQP is handed over to the Query Deployment Engine, which deploys, connects and activates the runtime equivalents of the plan primitives on the affected nodes. Note that the actual query plan compilation has been fully decoupled from the execution engine. This enables to execute the same deployment plan on various runtimes, including for example Storm, Akka [26] and Spark Streaming. 4.1.1 Substitution Rules Most query primitives - including project, filter, map, union, broadcast, merge and partition - can operate in parallel on multiple (sub)streams without compromising the semantic correctness of a query plan. Other primitives, like group, order, limit and join, are not streamparallelizable by default. CHive’s Query Plan Compiler therefore includes a set of substitution rules to replace these query primitives with semantic equivalents that improve the inter-stream parallelism of a query plan. The remainder of this section highlights these primitives and presents the associated substitution rules. Group-By Aggregation. This primitive aggregates events according to the values of one or more groupby attributes. Events that encapsulate identical values for these group-by attributes are collected in the same bucket. The primitive periodically executes one or more aggregation functions on the attributes of these grouped events. The latter can be triggered by an output rate limiter specified in the query (e.g. as a ”output every x seconds” clause), or when the size of the aggregation window exceeds a specified threshold. Currently, CHive implements aggregation functions sum, count, min, max and avg. The output event has cardinality set and contains one event record for each aggregated group, having the aggregated values as separate attributes. For partitioned streams, group-by can operate independently on each stream, provided that all intermediate results are collected in a final set later on. Unpartitioned streams require a two-pass aggregation to enable interstream parallelism. ⇣ S ⌘ h ⇣S ⌘i n n 0 group f, j=1 Sj = groupG f 00 , merge j=1 groupS (f , Sj )
Per stream, a groupS (f 0 , Sj ) operation groups incoming events and executes a local aggregation function f’ on the locally available data. This local aggregation function
may be slightly different from the original function, as it needs to safeguard all context information that is required to correctly execute the final global aggregation step. The associated global aggregation function f” processes all intermediate results generated by f’, such that the result of this operation is identical to the execution of f on the global data set. To illustrate this, let Si symbolize a particular data source, and Vi be the grouped values originating from that stream. To calculate the global average of V , f’ calculates sum(Vi ) and count(Vi ) for each stream Si , while f” deduces the overall average as sum(sum(Vi ), 8Si ) sum(count(Vi ), 8Si ) Group-All Aggregation. For queries containing aggregation functions without a GROUP BY clause, the query compiler integrates a group-all primitive in the query plan. This primitive is similar to group-by, in that it periodically executes one or more aggregation functions on the attributes of grouped events. The difference is that group-all collects all events in a single group, meaning that a singleton event will be produced by applying the aggregation functions on all events in the associated window. Since group-all cannot benefit from stream partitioning, its substitution rule equals the two-pass substitution rule for group-by on unpartitioned streams. Order. The order primitive takes a set of events as input and produces a new set including all records sorted according to the specified criteria. The order primitive cannot operate in parallel on multiple streams without compromising the semantic correctness of the query plan – as ordering streams individually does not yield a global ordering of all event streams. This can be resolved, by applying the following two-pass ordering substitution, inspired by the merge-sort algorithm: ⇣S ⌘ h ⇣S ⌘i n n order S = order merge order (S ) G S j j=1 j j=1 Per stream, orderS orders data events that belong to the same data window in O(nlogn). Next, orderG unites all these individually ordered sets into a globally ordered set in O(n). This global ordering step is executed after merging the intermediate results using the CHive merge primitive – which aggregates all related data events together, as we describe below. In contrast to orderS , orderG cannot operate in parallel on multiple (sub)streams. Due to the lower computational complexity of orderG , however, replacing a single order primitive with this two-pass ordering substitution improves the overall scalability of a query plan. Limit. This range primitive receives a set of events and produces a new output set, limiting the number of event records in the output set to a given row count. The limit primitive cannot operate in parallel on multiple streams, unless these streams are partitioned on one or more group-by keys – explicitly by using the PARTITION ON clause or implicitly via the @PARTITIONED_ON keyword. If so, inter-stream parallelism can be achieved by
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 7
applying the following two-pass limit substitution: ⇣S ⌘ h ⇣S ⌘i n n limit S = limit merge limit(S ) j j j=1 j=1
For each stream, a separate limit operation reduces the size of the aggregated data events. The final limit primitive, in turn, reduces the union of these partitioned data sets. This final limit is executed after the CHive merge primitive unites all intermediate results. Join. This primitive joins two streams together using a hash join function, preserving only those event combinations that meet one or more join condition(s). A join primitive has a window associated with each stream it joins. When a new event arrives on either one of the input streams, the join primitive tries to match the event with each event in the window associated with the other stream, according to the given join condition(s). An output event is produced for each match. Currently only INNER joins are supported. Although join operations are known to be parallelizable using MapReduce [10] and parallel database techniques [27], partitioning join for stream processing turns out to be very restrictive. The current version of CHive therefore does not include query substitution rules to improve the streamparallelism of join primitives.
RQP = Reference Query Plan DT = Deployment Topology [NP] = list of non-parallelizable primitives and their followers in RQP [UP] = all primitives in RQP-[NP] = not yet allocated [SDP] = list of source nodes in DT TRUNK = tail end of DT that doesn't have parallelism
[SDP] has more source nodes?
no
yes N = S = next source node in [SDP] [RP] = sequence of primitives in RQP reachable from any source component whose stream arrives in node S, but not in [NP] N = first node following N in DT that has not yet been processed [AP] = sequence of primitives in [RP] that can be allocated on N [MRP] = most reducing sequence of primitives in [AP] Allocate all primitives in [MRP] on N [RP] -= [MRP] [UP] -= [MRP]
no
[RP] is empty?
yes
Allocate all primitives in [UP] on TRUNK
4.1.2 Optimized Query Plan Generation Algorithm The query plan optimization algorithm for deployment onto a given network tree is illustrated in Figure 4. The algorithm is based on the following premises: 1) The reference query plan, by definition, contains the right sequence of primitives for bandwidth optimization purposes. 2) Query primitives that are not stream-parallelizable, need to be allocated on the trunk, defined as the tail end of the deployment tree that only has a single path towards the sink. Note that these are primitives that are not stream-parallelizable, even in the presence of substitution rules. 3) Query primitives that have an output-to-input ratio less than 1, need to be placed as close as possible to the relevant event source. Since overall bandwidth consumption is defined as the sum of bandwidth consumption on each deployment tree edge, reducing bandwidth early on a path from source to sink, translates into cumulative gains for that path. 4) Similarly, query primitives that have an output-toinput ratio larger than 1, need to be placed as close as possible to the sink. 5) Since 3) and 4) act as opposing forces, the optimization algorithm must operate on sub-sequences of query primitives rather than on individual ones to determine the most beneficial output-to-input ratio. As illustrated in Figure 4, CHive’s query plan optimization algorithm takes the reference query plan (RQP) and the deployment tree (DT) as input. First, it determines which primitives in RQP are not streamparallelizable. Since these components cannot be exe-
Traverse DT starting at each source node, down to the SINK and allocate required substitution primitives on non-leaf nodes. Wire-up all primitives allocated on each node of DT OQP = Optimized Query Plan
Fig. 4. Flow diagram of the distributed query plan optimization algorithm cuted on different branches of DT in parallel, they need to be placed on the tree trunk (as per premise 2). Next, DT is traversed starting at each source node down to the sink6 . For every stream arriving at the selected source node, the algorithm collects the primitives this stream passes through in RQP. Those primitives are called reachable primitives. We call allocatable primitives the subset of reachable primitives that can be allocated on a specific node in DT. A join primitive, for instance, can only run on a node through which all required input streams flow. Out of these allocatable primitives, the most reducing sequence is determined (premise 5). This is the sequence of primitives for which the output-to-input ratio is minimal. If the ratio is less than 1, this sequence of primitives must be deployed as close as possible to the event source(s) (as per premise 3). As long as there are reachable primitives left that haven’t been allocated yet, the next node in DT is selected to become the current node and the same procedure is repeated. When the above has 6. A source node is actually a logical node in the deployment tree identifying a point where an event stream arrives, without having processing capabilities associated with it.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing
network core edge layer access layer
8
MFS1/SOURCE
MFS2
MFS3
DC 1.1.1
DC 1.1.2
DC 1.2.1
DC 1.1
DC 1.2
DC 1.1.1
DC 1.1.2
DC 1.2.1
MFS1
MFS2
MFS3
project
project
project
filter
filter
filter
DC 1.1
DC 1
been done for each source node in DT, all that is left to be allocated must be allocated on the tree trunk. Again, parts that reduce the event stream the most are allocated as soon as possible (premise 3) on this trunk, while the rest is deployed as late as possible (premise 4). The previous step has allocated all primitives from RQP on DT nodes. The next step is to determine, based on the substitution rules of Section 4.1.1, which (missing) substitution primitives need to be allocated on downstream nodes to assure semantic equivalence with the original RQP. These additional substitution primitives support the allocation of query primitives on parallel branches of DT. Only non-leaf nodes and non-trunk nodes (except the first one) are taken into account to allocate substitution primitives. Finally, all primitives need to be wired-up. Wiring is the process of defining the edges in the graph of the overall optimized query plan (OQP). This step involves determining the correct primitive sequence and the schemas of each primitive’s input and output gate7 . 4.1.3 Examples In this section we depict two distributed query plans generated by CHive for the top-N query presented in Section 3, illustrating the impact of CHiveQL hints on the final query plan. The target hierarchical deployment network topology is visualized in Figure 5. Example 1 – Unpartitioned Streams. In the absence of CHiveQL hints, CHive assumes dealing with unpartitioned streams. It also assumes (almost8 ) worstcase output-to-input ratios for all primitives requiring hints to accurately calculate this ratio. The resulting OQP is shown in Figure 6. The project primitive’s output-to-input ratio can be calculated based on the provided input and output schemas and is determined to decrease the stream volume. The filter primitive, at worst, can have an output-to-input ratio of 1. 7. Input and output gates are strongly typed first-class communication citizens responsible for receiving incoming data tuples and delivering processed data tuples to a connected component, respectively. 8. ”almost” because a filter primitive is assumed to always reduce event streams. This is true as soon as a single event can be filtered out.
DC 1.2
union
union
RESULT/SINK
Fig. 5. Network Deployment Tree. Each M F Si represents a separate stream of MobileFlow events.
union
groupBy orderBy = substitution primitive
limit SINK
DC 1
Fig. 6. OQP for Top-N when no hints are provided Therefore, both primitives must be executed as close as possible to the event sources, as they reduce the event streams and are fully stream-parallelizable. Also the group-by primitive is stream-parallelizable, provided that the relevant substitution rule is invoked later (see Section 4.1.1). Since the lack of query hints yields a worst case scenario, the group-by primitive is assumed to increase the stream volume. Furthermore, the orderby and limit primitives are only stream-parallelizable on partitioned streams, so they need to be placed on the tree trunk. This also means that no sub-sequence of query primitives can be formed, containing the group-by primitive, that could potentially have an output-to-input ratio less than 1. The CHive query plan optimization algorithm therefore places the group-by primitive as late as possible. Both the intermediate nodes (DC 1.1 and DC 1.2) and the SINK node (DC 1) need to accommodate a union component to route events from multiple input streams onto a single output stream. The union component on DC 1.2 basically performs a simple passthrough function, since it only has a single input stream. Example 2 – Partitioned Streams. Figure 7 shows the generated OQP for distributed deployment of the topN query when the @PARTITIONED_ON hint is included. This hint informs the query plan optimizer that the group-by/order-by/limit primitive sequence can execute in parallel, since it can operate in total isolation from the other streams. Because the limit primitive reduces the stream the most, the entire sub-sequence is placed as close as possible to the respective sources. To support parallel execution of group-by, order-by and limit, additional substitution primitives need to be deployed on intermediate nodes and the final node of DT. The substitution rule for group-by on partitioned streams says that it must be followed by a union primitive on subsequent nodes. Since group-by is followed by orderby and limit on the same node in the OQP, the union of DC 1.1 produces 2 sets of ordered events. Next, the
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 9
DC 1.1.1
DC 1.1.2
DC 1.2.1
MFS1
MFS2
MFS3
project
project
project
filter
filter
filter
groupBy
groupBy
groupBy
orderBy
orderBy
orderBy
ReadBarrier W1
ReadBarrier W2
(points to last event in window W1 that hasn't expired yet)
RingBuffer ReadBarrier W3
All barriers move clockwise. WriteBarrier (new events are written here)
limit DC 1.1
limit
limit DC 1.2
union
union
orderBy
orderBy
limit
limit union orderBy
= substitution primitive
limit SINK
DC 1
Fig. 7. OQP for Top-N for implicitly partitioned streams substitution rule for order-by on DC 1.1 performs the total ordering of those two sets, and the substitution rule for limit retains only the first half. Finally, the same procedure needs to be repeated at DC 1. 4.2
Query Execution Library
After creating and deploying an optimized query plan, the latter needs to be executed. CHive’s Query Execution Library offers the required functionality to execute CHive query plans, including aggregation window functionality and a generic implementation of all query primitives. 4.2.1 Windowing Support In continuous querying, aggregation windows are the equivalent of database tables in the non-streaming world. They are used to build up limited history about event streams, so that calculations can be done on collections of events, rather than individual ones. CHive supports time-based and fixed-length aggregation windows. In this section we focus on sliding time-windows. The Windowing Support module is the most important component of the CHive execution library. Since most events will end up in aggregation windows, the overall scalability of query plans highly depends on the overhead this component adds per event. The CHive implementation therefore provides an optimized implementation, highly inspired on the LMAX Disruptor design [28]. Key to this design is the use of a pre-allocated, nonlocking ring buffer structure, as visualized in Figure 8. Since aggregation windows typically consume a lot of memory, it makes sense to facilitate sharing9 common 9. In our current prototype, multiple time windows - each supporting a different window size - can already be supported by a single RingBuffer structure, however, the mechanism to share runtime components hosting a time window across query plans is future work.
Fig. 8. Design of the Sliding Time Window Component. aggregation windows among similar query plans. This enables to substantially reduce the overall memory footprint of an analytics application composed of multiple (similar) continuous queries. In Figure 8, we show three windows (W1, W2 and W3) in a single RingBuffer structure. In this structure, there is a single Write Barrier and one Read Barrier per superposed window. The Write Barrier indicates the position in the RingBuffer where the next arriving event will be stored at, while each Read Barrier references the position of the last (oldest) event in the corresponding window that has not yet expired. In the sliding time-window implementation, all events are stored in the RingBuffer structure with a timestamp. Depending on the implementation or query needs, this timestamp could be the time at which the event was inserted in the RingBuffer or an embedded timestamp of the event itself. Whenever a new event arrives, all Read Barriers are first instructed to advance clockwise until they reach an event that has not yet expired (i.e. an event with a timestamp that is larger or equal to the current time minus the size of the window in milliseconds). Next, the new event is stored at the position in the RingBuffer referenced by the Write Barrier, and finally the Write Barrier moves one slot clockwise. The number of events in a particular time window equals the number of slots in the RingBuffer structure between the Write Barrier and that window’s Read Barrier. As a boundary condition, the Write Barrier can never move clockwise past any of the Read Barriers. When this condition occurs, the Write Barrier must wait until the last Read Barrier moves forward – the RingBuffer is said to be full. Since this causes a blocking condition and hence slows down event processing, it is important to correctly estimate the required size of the RingBuffer structure. This size can be calculated as S = Ip ⇤ T ⇤ M , where S is the size of the ringbuffer in number of slots, Ip is the expected event input rate in number of events per second at primitive p hosting the time window, T is the size of the time window in seconds and M is a safety margin, say 10% to accommodate fluctuations in event arrival rates. Ip can further be expressed as 0 1 X Y @ Ip = rp 0 A I e e2E
p0 2Pe,p
where E is the set of event sources (indirectly) feeding
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 10
into primitive p, Pe,p is the set of primitives in the query plan path from source e to primitive p, rp0 is the outputto-input ratio of primitive p0 and Ie is the expected event arrival rate of source e. The maximum size of the RingBuffer is limited by the amount of available memory in the processing node hosting it. In order to minimize Java Garbage Collection (GC), the RingBuffer is pre-allocated at startup with empty instances of the event type associated with the window. Storing new events in the RingBuffer involves updating the event attributes with new values, rather than overwriting the previous event object with a new object. This technique avoids creating garbage in case of attributes of primitive types. Experiments described in Section 5.2 indicate that in comparison to Esper [23], CHive reaches substantially higher event throughputs by limiting GC cycles and avoiding read-write contention through the use of a ring buffer structure, in addition to being able to handle larger time windows, resulting from a smaller memory footprint. 4.2.2 Query Primitive Library CHive’s Query Primitive Library contains an implementation for all supported query primitives listed in Section 3. Note that the implementation is independent of the underlying ESP platform. Dedicated primitive wrappers are used to adapt this library to the specifics of the chosen ESP platform – such as Storm. 4.3 Query Deployment Engine Finally, the Query Deployment Engine is responsible for deploying the OQP onto the processing nodes of the minimal Steiner Tree network. To enable executing a CHive query plan, the Query Deployment Engine provides an ESP-specific implementation of all supported query primitives. To operate on top of Storm [3], the query primitive functions have been wrapped into Storm bolts. In addition, this storm-specific implementation includes a Topology Builder component that transforms the OQP into individual storm topologies10 and a custom scheduler to assign tasks accross the workers in the cluster, taking the hierarchical network topology into account.
5
E VALUATION
This Section evaluates the claimed benefits of CHive, both in terms of bandwidth savings and raw execution speed. We first present a theoretical evaluation of the top-N query presented in Section 3, taking the AlcatelLucent WNG (Wireless Network Guardian) product as a use case. Next, we perform a practical evaluation using a stream of CDR data obtained from a large mobile operator. Finally, we also compare the performance of the CHive query execution library against Esper, a popular open source CEP engine. 10. In a distributed cloud environment, each node in the logical Network Topology can actually represent a micro data center and as such be a cluster of processing nodes.
5.1
Theoretical Evaluation: Top-N Use Case
This evaluation assumes that probes are strategically placed throughout the network (access layer) to gather information on the number of bytes downloaded from each web-site whenever users of the network are accessing the Internet. Each probe generates an event stream, called MobileFlow Stream (MFS), at an average rate of x events per second. We also assume processing nodes to be available at the various layers of the network, i.e. the access layer, the edge layer and the network core. This setup is depicted in Figure 5. In what follows, we calculate when it becomes more efficient to process event streams in a distributed fashion rather than doing it in a single data center, taking bandwidth consumption as the efficiency criterion. In the case of distributed execution, multiple strategies can be chosen depending on whether the streams are partitioned on the group-by field or not. We will consider three distributed execution cases: naturally partitioned, unpartitioned and explicitly partitioned streams. Given the fact that we are considering a telco network11 here and that such a network is built up in a hierarchical way, we can express overall bandwidth consumption as B = Bedge + Bcore or in other words the sum of the bandwidth consumption in the edge layer (i.e. between each access-layer DC and it’s downstream edge DC) and the core layer (i.e. between each edge-layer DC and the central data center in the network core). In what follows, we express bandwidth as number of events/second, hence canceling out the actual event size. 5.1.1 Centralized In this case, all event streams are sent to the central DC, where the top-N algorithm runs. The amount of information sent across the network towards this central DC can easily be calculated as Bedge = Bcore = ax so that Bcentral = 2ax where a is the number of MobileFlow event streams in the network and x is the rate at which events are being produced by each probe. 5.1.2 Distributed – Naturally Partitioned If the group-by field values are mutually exclusive per stream (such as for grouping by location), each accesslayer DC could simply calculate its top-N locally and only report these top-N results to the edge layer at each output moment (frequency y). The edge-layer merging would be very simple: i.e. picking the top n values out of the ai ⇤ n values being reported to it, with ai being the number of access-layer DCs connected to this edge P DC ( i ai = a). It follows that in this case, bandwidth consumption can be expressed as: Bedge = any and Bcore = eny so that Bnatural = any + eny = (a + e)ny where a is the number of MobileFlow event streams 11. Although we take the specific setting of a hierarchical telco network to illustrate the value, the concept is equally applicable to any other type of network - including full mesh - since part of the optimization strategy is to first reduce the network topology to a tree structure, by calculating a minimal Steiner Tree onto that topology.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 11
in the network, e is the number of edge-layer DCs, n is the value of N in top-N and y is the chosen output frequency of top-N calculations. Distributed execution in case of naturally partitioned streams therefore becomes interesting as soon as Bnatural < Bcentral )
a+e ny < x 2a
As an example let’s consider a network serving 100M users, one access-layer DC per 100K users and one edgelayer DC for every 50 access layer DCs. This translates into a = 1000 and e = 20. Top-10 calculation then gains from distributed execution as soon as the event arrival rate x >= 6 events/s. 5.1.3 Distributed – Unpartitioned In general however, we don’t have a mutually exclusive set of group-by field values per stream, such as in the specific use case of grouping by HTTP host. In this case we need to report the entire set of ”download volumes by host” to the next layer at each output moment. The merging algorithm at the next layer first needs to sum the download volumes per host and then calculate the new n highest values. We can express bandwidth consumption for this (unpartitioned) use case as Bedge = aTm y and Bcore = eTm y, so that
xw
our equation becomes Bunpartitioned = (a + e) 1 (1 s s) y In the general case, distributed execution becomes interesting when Bunpartitioned < Bcentral )
(a + e)(1
(1 2as
s)xw )
y
a+e y 2as
Using the same example as above for the mutuallyexclusive distributed case, and a selectivity factor s = 1 1000 , unpartitioned distributed event processing is beneficial as soon as the event arrival rate x >= 510 events/s. 5.1.4
Distributed – Explicitly Partitioned
As a bounding condition, we have T0 = 0. Note that (2) is also known as the Birthday problem. Also note that Tm min(m, 1s ). In case of a sliding time window, m = xw with x representing the event arrival rate and w symbolizing the window size in seconds12 . As a result,
Since partitioned streams can be processed more efficiently in a distributed setting than unpartitioned streams, CHive offers the PARTITION ON clause to explicitly partition the streams before processing them. This strategy can be beneficial for those use cases in which only a marginal amount of events arrive at the ”wrong” DC, such as e.g. partitioning on user ID in a roaming scenario. According to the hierarchical network topology depicted in Figure 5, however, we cannot assume a direct communication link to exist between all access-layer DCs. Instead, in order to reach another access-layer DC, communication needs to follow the physical path DC1.1.2 ! DC1.1 ! DC1.1.1. In the worst case13 , if an event needs to be sent to an access-layer DC in another edge domain communication may need to follow the path DC1.2.1 ! DC1.2 ! DC1 ! DC1.1 ! DC1.1.1. Hence, the overhead of explicit partitioning can quickly outweigh the potential benefits. To calculate the bandwidth consumption for this case, we introduce a new variable f representing the percentage of events that arrive at the right DC, i.e. the DC that is responsible for processing all events sharing the same key value. In case of an equal distribution of key values across all data centers, f would be equal to a1 . The mutually-exclusive case corresponds to f = 1. It is expected that f should be much higher than a1 for this strategy to be beneficial. Given the above definition of f , the total amount of events arriving at the right access-layer DC is given by af x, while the total amount of outliers can be defined as
12. Only at the very start of the stream the window would start off empty. We are ignoring these startup effects in our calculations, as we are only interested in sustained bandwidth measurements.
13. In typical telecommunications networks, edge-layer nodes are interconnected in a ring structure. We are however assuming this not to be the case here, since we are looking to obtain worst-case values.
Bunpartitioned = aTm y + eTm y = (a + e)Tm y
(1)
where a, e and y are as defined before. Additionally, Tm represents the amount of aggregation slots that are occupied in a time window when m events have arrived since the window was opened. dup To calculate Tm , lets first call Pm the probability that th the m event arriving in the window w has a duplicate value for the aggregation key – i.e. this event can successfully be aggregated with another event already present in the window, and hence will not consume an additional aggregation slot in the window. 8m 1, this dup probability can be expressed as Pm = sTm 1 where s is the selectivity factor of the group-by field in the stream. This selectivity factor can be defined as the inverse of the amount of possible distinct values for the aggregation key(s) in the event stream. It follows that Tm = Tm
1
dup Pm ) = 1 + (1
+ (1
) Tm =
m X1 i=0
(1
s)i =
1
s)Tm
(1 s)m s
1
(2)
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 12
a(1 f )x. The extra bandwidth consumption in the edge layer caused by explicit partitioning therefore equals extra Bedge = 2a(1
f )x
which covers communication from the originating DC to the edge DC and from the edge DC to the responsible DC. Let’s assume a hierarchical network with a perfectly balanced tree structure having a access-layer DCs and e edge-layer DCs. The extra bandwidth consumption in the network core caused by explicit partitioning corresponds to the total number of outliers arriving in an access-layer DC that is not connected to the same edgelayer DC as the access-layer DC responsible for processing the outlier. The extra bandwidth consumption is a extra Bcore = 2(e 1) (1 f )x e i.e. downstream from edge to core plus upstream from core to responsible edge. The total bandwidth consumption for explicit partitioning when all processing happens exclusively at the access-layer DCs therefore equals Bexplicit
= =
extra extra Bnatural + Bedge + Bcore ✓ ◆ 1 (a + e)ny + 2a(1 f ) 2 x e
Taking the same example as before, with a = 1000, 1 e = 20, y = 1, n = 10, s = 1000 and f = 1/a (equal distribution), we can deduce that explicit partitioning is beneficial to unpartitioned processing as long as x < 259 events/s. Figure 9 shows for which values of 1 f , related to the per-DC event input rate x, this strategy becomes beneficial. On the left, we see that more outliers can be supported as the selectivity factor is lowered, i.e. the key value space becomes larger. On the right, we see the influence of the value of n on 1 f . More specifically, the left part of the figure shows that for a selectivity factor s = 1/1000, explicit partitioning is always beneficial for an event input rate x < 259; for s = 1/2000, this becomes x < 521; for s = 1/10000 we get x < 2615. Similarly, the right part of the figure shows that for n = 10, explicit partitioning is always beneficial for an event input rate x < 258; for n = 100, this becomes x < 235; for n = 500 we get x < 130. 5.2
Experimental Evaluation
Two aspects of the CHive system are evaluated: the bandwidth reduction capabilities of distributed execution versus a centralized approach and the raw performance of CHive’s query execution library, compared to Esper, a popular open source CEP tool. 5.2.1 Bandwidth Reduction The maximum achievable bandwidth reduction largely depends on the query at hand as well as on the characteristics of the event streams and the number of layers in the deployment hierarchy. Figure 10 shows bandwidth
reduction for various configurations of a Top-N query similar to the one shown in section 3. We used the 3layer network topology of Figure 5, representative of most telco networks. The experiments process the CDR log of a large mobile operator, covering an entire day14 . The figure shows bandwidth reductions up to 99.9% for partitioned streams (calculating the top-10 Mobile Base Station (MBS) IDs initiating most on-call minutes) and between 84% and 95% for unpartitioned streams (calculating the top-10 call destinations). For unpartitioned streams, most gains are realized by the projection step, which reduces events to only retain those attributes of interest to the top-N query (here 2 out of 48). This is, however, a frequently occurring situation in real-life streaming analytics processing. Partitioned streams gain most from the aggregation step followed by the limit. The lower the output frequency of the aggregation step, the higher the reduction will be. In our experiments, the output frequency is 1Hz. Finally, we experienced a marginal effect of window size (w) on the reduction factor: the larger the time window, the more events are aggregated into the window, the more slots are likely to be occupied up to some maximum value (1/s) and therefore the larger the size of the aggregated dataset reported at the chosen output frequency. To justify our claim that off-the-shelf ESP tools are currently not optimized for bandwidth efficiency and hence not suited for stream processing in distributed clouds, we ran similar experiments on the same dataset using Apache Spark Streaming [22]. Bandwidth consumption was 12.5 GB, or 4x the central processing approach. The bulk of the inefficiency can be attributed to the shuffle phases, which are known (also in stock Hadoop) to be very wasteful of bandwidth. In future work, we will look at how Spark can be made bandwidth efficient, for example through the addition of customized job schedulers and techniques explained in [15]. 5.2.2 Query Execution Library Performance As the amount of historic data being stored during continuous query processing impacts performance the most, we now measure and compare raw throughput and memory footprint of both CHive’s and Esper’s aggregation window support15 . The benchmarks in this section were performed on an 8-core 2 Ghz Intel Xeon 32bit processor with 16GB of memory, running CentOS 6.2, Java SE 1.7, Storm 0.8.2 and Esper 4.9.0. Maximum Achievable Window Size - As a first experiment, we determine the maximum achievable window size for an event input rate of about 15K events/second, both for the standalone Esper implementation and CHive on a single worker Storm cluster (1 node, 1 14. To speed up the experiments, we played the dataset at 3K events/s per MBS, i.e. much faster than real-time. 15. Note however that the results should be taken with a grain of salt: Esper is much more mature than our prototype, hence supporting a significantly broader set of query features. Taking CHive to the same level of completeness as Esper may impact raw performance figures.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 13
Fig. 9. Max percentage of outliers for explicit partitioning to be more efficient than unpartitioned processing. On the left, for various selectivity factors; on the right, for various values of N.
3258 526
savings = !"#!$%&
427
!"#!!
%$418
!"#$"%&
343 162 3.51
!"#$!%& !"#$%$& !!"#!
%$Fig. 10. Bandwidth consumption (in MB) for Top-N execution: [top] central processing, [bars 2-6] hierarchically distributed processing of unpartitioned streams for various window sizes w and [bottom] for partitioned streams.
JVM). For both setups, the JVM was configured with the -Xmx2G memory setting (i.e. the highest possible setting on a 32bit JVM). Esper could handle windows of maximum 50 seconds, while CHive was able to support a windows of 720 seconds or 12 minutes, which is 14.4 times better. Note that in order to support larger time windows, one must resort to parallelizing execution. CHive supports automatic parallelization as one of its optimization strategies, based on hints provided in the query. Since this involves splitting an event stream across various workers, the actual event rates per worker will be much lower (ideally divided by the number of workers), so that the maximum size of the time windows can grow proportionally with the amount of workers. In order to achieve the same behavior in Esper, one must manually configure the event routing to multiple Esper instances and make sure the partial per-instance results get merged afterwards, so as to produce a result that is semantically equivalent to the original query. Maximum Achievable Throughput versus Window Size - Next, we compare the maximum sustainable event input rate (measured in events per second) in function of various window sizes for both Esper and CHive – again operating in a single JVM. The results are depicted in Figure 11. For the smallest time window (10 seconds), CHive’s throughput equals 1.5 times the throughput of Esper, while for the largest time window in this test (2
Fig. 11. Maximum sustainable input rate vs window size minutes), CHive outperforms Esper by 6.5 times.
6 C ONCLUSION AND F UTURE W ORK This article illustrates the benefits of a new technology in the event stream processing toolbox, enabling bandwidth-efficient distributed stream processing in Big Data networks hosting densely distributed event sources – telecommunication networks and distributed cloud environments in particular. CHive seeks to offer a Hivelike solution to simplify and optimize streaming analytics in telecommunication clouds. As such, it offers a streaming query language enriching Esper’s Event Processing Language (EPL) to facilitate distributed query execution. Starting from a CHiveQL query expression and a description of the deployment network, CHive’s query compiler generates a query deployment plan that minimizes the expected overall bandwidth consumption. Minimizing bandwidth consumption is achieved by reducing event streams as soon as possible, i.e. as close as possible to the event sources. Since this requires a large amount of parallelism in the query plans, the CHive query compiler includes a set of substitution rules for replacing query primitives with semantic equivalents that increase the resulting degree of inter-stream parallelism. In addition, we developed a scalable Storm-based execution environment including a home-built repository of query primitive implementations to execute a CHive query deployment plan. A mathematical evaluation and early experiments using real operator data indicate that CHive can yield bandwidth reductions upwards of 99%.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2424868, IEEE Transactions on Cloud Computing 14
Future work targets multiple optimizations of both the CHive query compiler and the execution engine. As such, we plan to build monitoring components tapping into various places in the query execution workflow to gather measurements on data distribution, current event rate, output-to-input ratios, execution latency and throughput, in order to continuously optimize query plans and query execution performance. Note that these measurements will make user-provided hints obsolete. Another optimization is to reuse components (processing components, aggregation windows) across multiple similar continuous queries in order to significantly reduce the overall memory footprint. We also plan to assure end-to-end fault-tolerance for distributed query plans deployed across multiple Storm clusters. Finally, the ultimate Big Data analytics tool offers a unified platform facilitating analytics on both historical data and live streams. This is also referred to as the Lambda Architecture in [29]. CHiveQL was purposely designed to have features of both Hive and Esper so that data analysts familiar with one of these technologies have a familiar language to define queries spanning the batch-oriented and real-time processing worlds. At the execution layer, we plan to build the necessary components bridging both worlds to form an all-encompassing analytics tool.
R EFERENCES [1] [2] [3] [4]
[5] [6]
[7] [8] [9]
[10] [11]
[12]
[13]
Apache hadoop. [Online]. Available: http://hadoop.apache.org/ Spark - lightning-fast cluster computing. [Online]. Available: http://spark.apache.org Storm - distributed and fault-tolerant realtime computation. [Online]. Available: http://storm.incubator.apache.org D. Peng and F. Dabek, “Large-scale incremental processing using distributed transactions and notifications,” in Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 2010, pp. 1–15. F. Frattini, K. S. Trivedi, F. Longo, S. Russo, and R. Ghosh, “Scalable analytics for iaas cloud availability,” IEEE Transactions on Cloud Computing, vol. 2, no. 1, pp. 57–70, 2014. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Z. 0002, S. Anthony, H. Liu, and R. Murthy, “Hive - a petabyte scale data warehouse using hadoop,” in Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA. IEEE, 2010, pp. 996–1005. S. Babu and J. Widom, “Continuous queries over data streams,” SIGMOD Rec., vol. 30, no. 3, pp. 109–120, Sep. 2001. I. Satoh, “Mapreduce processing on iot clouds,” 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, vol. 1, pp. 323–330, 2013. L. Mai, L. Rupprecht, P. Costa, M. Migliavacca, P. Pietzuch, and A. L. Wolf, “Supporting application-specific in-network processing in data centres,” in Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, ser. SIGCOMM ’13. New York, NY, USA: ACM, 2013, pp. 519–520. J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. M. Hayes and S. Shah, “Hourglass: A library for incremental processing on hadoop,” in Proceedings of the 2013 IEEE International Conference on Big Data, 6-9 October 2013, Santa Clara, CA, USA, 2013, pp. 742–752. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: A warehousing solution over a map-reduce framework,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1626–1629, Aug. 2009. Apache hive. [Online]. Available: http://hive.apache.org
[14] S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, “Query optimization for massively parallel data processing,” in Proceedings of the 2Nd ACM Symposium on Cloud Computing. ACM, 2011, pp. 12:1–12:13. [15] C. Jayalath, J. Stephen, and P. Eugster, “From the cloud to the atmosphere: Running mapreduce across data centers,” Computers, IEEE Transactions on, vol. 63, no. 1, pp. 74–87, Jan 2014. [16] D. J. Abadi, D. Carney, U. C ¸ etintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, “Aurora: A new model and architecture for data stream management,” The VLDB Journal, vol. 12, no. 2, pp. 120–139, Aug. 2003. [17] D. J. Abadi, Y. Ahmad, M. Balazinska, M. Cherniack, J. hyon Hwang, W. Lindner, A. S. Maskey, E. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik, “The design of the borealis stream processing engine,” in In CIDR, 2005, pp. 277–289. [18] N. Marz. Trident tutorial. [Online]. Available: https://github.com/nathanmarz/storm/wiki/Trident-tutorial [19] P. Nathan, Enterprise Data Workflows with Cascading, 1st ed. O’Reilly Media, Inc., 2013. [20] Cascading. [Online]. Available: http://www.cascading.org [21] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears, “Mapreduce online,” in Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, ser. NSDI’10. Berkeley, CA, USA: USENIX Association, 2010, pp. 21–21. [22] Zaharia, Das, Li, Shenker, and Stoica, “Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters,” in Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing, ser. HotCloud’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 10–10. [23] Complex event processing with esper. [Online]. Available: http://esper.codehaus.org/ [24] J. Medved, “Alto network-server and server-server apis,” IETF, Draft, 2011. [25] Seedorf, Kiesel, and Stiemerling, “Traffic localization for p2papplications: The alto approach.” in Peer-to-Peer Computing, Schulzrinne, Aberer, and Datta, Eds. IEEE, 2009, pp. 171–177. [26] Akka framework. [Online]. Available: http://akka.io/ [27] D. DeWitt and J. Gray, “Parallel database systems: The future of high performance database systems,” Commun. ACM, vol. 35, no. 6, pp. 85–98, Jun. 1992. [28] Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads. [Online]. Available: http://lmax-exchange.github.io/disruptor/files/Disruptor1.0.pdf [29] N. Marz and J. Warren., Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2013.
Bart Theeten is a Member of Technical Staff at Bell Labs in Antwerp, Belgium. He holds an M.Eng. in Computer Science from the University of Ghent, Belgium. He began his career as a software engineer and later technical team lead working on various network and element management solutions, before joining the Scalable Data Processing department at Bell Labs where he focusses his research on building massively scalable, real-time distributed Big Data management systems.
Nico Janssens is a Member of Technical Staff at Bell Labs in Antwerp, Belgium. He holds a M.Sc. in Informatics and a Ph.D. in Computer Science, both from the University of Leuven, Belgium. His current research interests include various topics related to the cloudification of telecommunication software, including dynamic right-sizing (elasticity), massive scalability and real-time cloud analytics.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.