A “Fast Data” Architecture: Dashboard for Anomalous ...

4 downloads 0 Views 1MB Size Report
Zookeeper and a set of components designed and implemented by us. These two implementations have been compared in terms of velocity and scalability ...
The Eleventh International Conference on Digital Information Management (ICDIM 2016)

A “Fast Data” Architecture: Dashboard for Anomalous Traffic Analysis in Data Networks Miguel Angel López Peña, Carlos Area Rua, Sergio Segovia Lozoya Research and Development Department Sistemas Avanzados de Tecnología, S.A. Madrid, Spain {miguelangel.lopez, carlos.area, sergio.segovia}@satec.es

This paper proposes an architecture design aligned with the emergent concept of “Fast Data” that provide high performance and scalability in continuous data processing. Two different design approaches are developed and tested, and both models are implemented as core of the ONTIC Dashboard prototype to compare its performances.

Abstract—Fast Data is a new Big Data computing paradigm that ensures requirements such as Real-Time processing of continuous data stream, storage at high rates and low latency with no data losses. In this work we propose a "Fast Data" architecture for a specific kind of software application in which input data arrive very fast and the results for each processed data have to match such input rates. We applied this architecture to build a Dashboard for Anomalous Traffic Analysis in Data Networks. In order to fulfil the requirements of Real-Time processing and no data losses, we carry out a design that consists of a pattern of dynamic tree of process pipelines, where the number of branches increases proportionally to the input data rate. Two different approaches have been followed to implement this design pattern: one based in a well-known set of products from the Big Data ecosystem; and the other built with Kafka, Zookeeper and a set of components designed and implemented by us. These two implementations have been compared in terms of velocity and scalability performance. As a result, the implementation built with our own components is significantly faster and scalable than the traditional one. The good results obtained by using both the design pattern of dynamic tree of process pipelines and our implementation make them very suitable for its use in other scenarios and applications such as smart cities, environment monitoring, industry 4.0, distributed control systems, etc.

II. ONTIC DASHBOARD DESCRIPTION The proposed dashboard application is an ISP/CSP network administrator tool that provides full network supervision by being able to perform an online traffic monitoring plus anomaly detection. In order to achieve this goal, this dashboard application is composed by two main modules: the network data processing module and the result visualization module. The network data processing module accepts two incoming Real-Time data sources: the data traffic from the network in Netflow V5 format [5], and the traffic anomalies from the external anomaly detection application (for this work that application has been implemented ad-hoc as a set of rules similar to the firewall policies). Each data input stream is processed by its own flow consisting of four functional steps: data ingestion, parsing, processing and storage. Concurrently, the result visualization module shows, in near Real-Time and on a dynamic user interface, the analytics about the data traffic and data anomalies (overall traffic statistics regarding IPs, ports, type of service, #packets, #bytes, etc.; statistics related to traffic flows, such as conversations; status in any time interval and detailed information about it; list of anomalies detected externally and complete statistics for each one; etc.) (Fig. 1)

Keywords— Fast Data, Big Data, Data-Driven, Continuous Data Processing, Stream Processing, Scalability.

I.

INTRODUCTION

“Fast Data” defines those systems that have features such as continuous input data streams, Real-Time processing, datadriven, no data losses and data and results storage at high rates [1]. We are not therefore facing a typical “Big Data” scenario defined by characteristics like variety, velocity and volume in a batch processing [2], but rather by conditions such as continuous, fast and reliable data processing and flexible storage and query operations that turn out to be key requirements in our structural design [3]. In this manner the concept of "Fast Data" [1] happens to be more appropriate than the "Big Data" one.

The anomaly detection in a real scenario (i.e. to monitor a real network link in an ISP data center) needs to fulfil a set of non-functional requirements related to performance and scalability [6], such as:  Data processing from different sources with different formats.  Processing high volume of input data without losses.  Parallel processing from all input data streams.

One of the prototypes defined in the ONTIC Project [4] is an analytic dashboard for anomaly detection in data networks. This prototype is a software application that needs ingest network traffic data and intrusion/attack/anomaly events at continuously high-speed rates.

 Distributed, scalable and highly available architecture.  Near Real-Time in both ingestion/processing/storage process as in query/visualization process.

978-1-5090-2641-8/16/$31.00 ©2016 IEEE 37

Fig. 1. Dashboard user views.

All these requirements force to design a system over a datadriven support architecture in which prevail: velocity in terms of input data rate, reliability (without data losses), continuous processing (instead of batch processing), operations on each data not very complex but definitely fast, and agility in storing and accessing data.

Accordingly, the architectural model is composed of the following components:

This kind of architecture, recognized with the name of "Fast Data Architecture" [7], [8], [9], whose generic structure is shown in Fig. 2, has been designed and implemented in this work as part of the dashboard system, in a process of building block integration (some of them developed ad-hoc by SATEC and others produced by third parties).

 Message queuing system, where data is partitioned and spread over the cluster to allow streams larger than the capability of any single machine and to allow data be fetched. This component ensures that each input data is processed only one time and that all data are processed.

 Data Collector: it is the entry point to the architecture. It is responsible for receiving streams of raw data from different sources, and shipping them to a message queue system avoiding data losses.

 Data Filter/Processors: it is a group of coordinated data consumers that fetch the queued data, make some process (parsing, aggregations, fast analysis, etc.) and finally store the modified flow in a Database.  Elastic NoSQL Database: is a non-relational Database, distributed, schema-free, consistent, able to work with huge amount of data and horizontally scalable. 

Fig. 2. Generic Fast Data architecture.

III. ONTIC DASHBOARD SOLUTION ARCHITECTURE Bearing in mind our application’s requirements, we will need a distributed architecture with the capability to handle a growing amount of work, which means it should be able to increase its total output under an increased load when resources (typically hardware) are added, and besides do it at the same speed as data arrives (near Real-Time process). Additionally, our system will be designed so that it copes with loads dynamically by expanding and contracting resource consumption across a cluster of computers, in a matter of minutes or even seconds. Precisely, we are talking of a distributed architecture both scalable and elastic [10], ready to deal with fast continuous vast amounts of data.

Distributed multitenant-capable full-text Search engine/Database: is the ending architectural component through which data are recovered by using powerful NoSQL queries in a continuous query model [11], [12]. Therefore this component: searches sets of data from the whole Database, recovers those data and delivers them ready to be analysed.

This model draws an architectural pattern similar to a pipeline that spreads out (and shrinks) as a tree, in such a way that when a component of the architecture expands, all following components must scale at least in the same order to handle the heavy load, but when the load becomes less demanding these same components must balance down to fit the new conditions and thus not wasting unnecessary resources (Fig. 3).

38

Fig. 3. Dashboard “Fast Data” Architecture.

The Message Queuing System component (Kafka) has not been modified, but it is important to note its role in the architecture because it is used so that allows to replicate each pipeline as many times as necessary to be able to process the data flow at the rate it arrives without losses.

IV. FAST DATA ARCHITECTURE IMPLEMENTATION A. Big Data implementation approach The first implementation approach was building the architecture by using components selected from the Big Data ecosystem. We started off by relying on a well-known opensource solution called ELK [13], because it fitted almost entirely our architectural design. ELK is a software stack that stands for Elasticsearch, Logstash and Kibana and in their own words is a “project that helps you take data from any source, any format and search, analyse, and visualize it in Real-Time”. More precisely, Logstash was fitting with both our “data collector” and “data filter/processors” functionality and Elasticsearch with our search engine storage demands. Due to ELK does not include any “queuing system” as a component, we chose Apache Kafka [14] (a “publish-subscribe messaging rethought as a distributed commit log, fast and scalable”) due to its healthy active community and widespread use.

The Data Filter/Processor is the more complex element in the pipeline because it has to assume some tasks such as: calculations, data aggregation, filters, etc. and eventually the data in the Database. So this component has been built as light as possible avoiding unnecessary work, so that also shows a slightly faster performance than Logstash and, what is more, exhibits the desired elastic functioning thanks to the use of Zookeeper coordination [15]. As in the case of the previous component (Collector) it is necessary to develop one of these items for each data source and its operation within the architecture is such that it is replicated at the same rate as queues, ensuring scalability and consistency of the entire pipeline. Therefore, this component along with the queuing system are the key that supports the design of dynamic tree of process pipelines.

The implementation and deploy of this approach was easy and quick, nevertheless the first tests performed demonstrated that Logstash was significantly inefficient since it did not let us match high speed rates, in addition to not showing an elastic behaviour. Fig. 4 shows a load test and demonstrates how Kafka queues have to increase the contention because the processing speed is lower than the input rate.

Although the Real-Time Visualizer component should not be consider as part of the architecture but of the application, it is important to highlight that in our solution it has been developed to substitute to Kibana. The Real-Time visualizer is based in a set of graphical libraries (D3, C3, Leaflet, etc.) and includes the fast query and data recovering from the Database as well as a powerful interactive graphical data visualization system.

B. SATEC implementation approach Due to the obtained results with the first approach we decided to move to a second approach in which we have carefully designed and coded our own lighter suitable Java solutions for both Data Collector and Data Filtering & Processing.

Once the new components were coded and the whole architecture was integrated we proceeded to check its performance with the same load test used in the first approach. The obtained results (Fig.5) demonstrate that this design solves the limitations of the first design in three aspects: the total execution time is reduced around an 87% (from 35 minutes to 4.5 minutes), the execution time for each data matches with the input rate and, as a result, Kafka does not have to make contention.

The first component that has been redesigned and coded has been the Data Collector. Our Data Collector is only responsible for parsing incoming byte arrays before forwarding them to Kafka. This component is coded for each data source with very few lines of code. It works in memory and behaves notably faster than Logstash.

39

Fig. 4. Load Test of the ELK based architecture implementation.

MMM

Fig. 5. Load Test of the SATEC’s architecture implementation.

 Creating data record with relevant information.

Comparing the results of these load tests (Fig. 5) with the results obtained in the previous design (Fig. 4) we can observe the improvements of the Satec's design and conclude that just our implementation can satisfy the throughput requirements needed.

 Sending data record to Kafka. The "Data filtering and processing" functionality in both implementations is the following:  Reading the flow data record from Kafka.

V. SATEC ARCHITECTURE PERFORMANCES

 Parsing the record and converting the data types to the standard ones (long, date, integer, etc.).

In order to validate our architecture’s brand new components —“data collector” and “data filter/processor”— and the final design, it was prepared a test where both implementations (ELK-based and ours) consume 2.3 Million of Netflow records (each around 200 bytes long) from an UDP port simulating different speeds, in order to compare the limits between both alternatives. The queue system and Database are provided with enough resources to make sure they do not pose a problem in the form of a bottleneck and thus the only difference lies in the components that we have replaced from the ELK architecture.

 Calculating new conversations.

aggregated

fields:

flows

and

 Calculating the timestamp and converting to standard date type.  Removing the unnecessary labels.  Storing the enriched record into the Database. The hardware environment available used for the tests is a single commodity server with the following features:

The Data Collector functionality in both implementations is the following:

 Processor: 2 x Intel Xeon E5-2620 v2 @ 2.10GHz 6 Cores 12 Threads.

 Reading Netflow V5 data (both headers and records) from an UDP port.

 Memory: 256 GB Ram DDR3.

 Separating header records and flow records.

40

 Disks: 4 x WD RE 2TB 7200RPM SATA, 1 x SSD 128 GB (only for host OS).

Taking into account the capacity limit forced in the Database, the obtained results of speedup expose a promising linear scalability according to Amdahl's Law (1) [16] and Gustafson’s Law (2) [17] and a high grade of efficiency of the Parallel System (3) [18] for up to 6 process pipelines.

 OS: Ubuntu 14.04.2 LTS in all machines. Also a virtual environment, connected by internal network to emulate several machines, was deployed. The specifications for this virtualization are:  SATEC’s data collector (6vCPU, 32GB).

S = 1 / ((1 - α) + α / P)

(1)

S(P) = P - α · (P-1)

(2)

E=S/P

(3)

 Kafka (2vCPU, 32GB).  SATEC’s data filter/processor (2vCPU, 16GB). 

Where S is the Speedup, α is the Grade of Code Parallelism, P is the number of process instances or process pipelines and E is Efficiency of the Parallel System.

Database: 1 x Elasticsearch Master (2vCPU, 32GB), 2 x Elasticsearch Data (5vCPU, 16GB).

The results demonstrate the ELK-based architecture is able to process 2.6 KRecords/sec, whereas our architecture implementation is able to process up to 12.5 KRecords/sec, which means a performance 4.8 times better, accomplishing the desired near Real-Time throughput (Fig. 6).

The tests performed, for up to 6 process pipelines, show that the Speedup (S), expressed in (1), reaches a value of 5.97 for 6 process instances or process pipelines (P) with a Grade of Code Parallelism (α) close to 100% and an Efficiency of the Parallel System (E), defined in (3), higher than 93.7%. From more than 6 pipelines the system cannot scale anymore because output rate of the filtering/processing module and the Database throughput (intentionally set to around 55Krec/s) are matching, the value of Speedup (S) stops growing, and Grade of Parallelism (α) and Efficiency (E) values start to gradually decrease. Fig. 7 illustrates the tests results and shows a very high speedup of the architecture in terms of scalability performance and efficiency. VI. CONCLUSIONS AND FUTURE WORK As a result of this work it has been defined and tested a simple and powerful "Fast Data" architecture with good performance and that is able to scale horizontally when input data velocity increases. The base concept of create a dynamic tree of process pipelines demonstrates to be very effective in terms of scalability.

Fig. 6. Processing speed comparison.

Once we achieved an operative and efficient implementation in terms of processing velocity, we proceeded to evaluate its load scalability making a set of tests without limiting reading speed. In order to perform the scalability tests, 2 clusters were installed and configured in the Google Cloud platform:

Taking into account the hopeful results regarding to scalability and performance obtained with the described architecture, we plan to continue improvement with the following features:

 Cluster 1 with 15 virtual nodes (n1-standard-1, vCPU: 1, RAM: 3.75GB each) to run the pipelines in parallel.

 Automatic horizontal scalability: incorporating a performance monitor to raise a system process which deploy new component instances (pipeline tree model) on the hardware available (virtual nodes o real servers) when necessary.

 Cluster 2 with 6 virtual nodes to run the Database: 1 master node (n1-highmem-4, vCPU: 4, RAM: 26GB) and 5 data nodes (n1-standard-4, vCPU: 4, RAM: 15GB).

 Better processing capabilities: including a new component in the pipeline based in a fast rule engine to enable data processing as an event-driven model.

The core of the architecture (except Database and message queuing system) was deployed over 14 nodes to run progressively from 1 pipeline up to 14 pipelines in parallel. A single instance of message queuing system was deployed in the 15th node. The Database was installed in the 6-node cluster in a non-scalable deployment, with the purpose to isolate the last pipeline's step in order to study the hardware dependency in the scalability (this configuration provided a DB throughput around 55,000 records/s).

 Capacity to accept new data sources without stop the current internal processes. Besides we are working in the definition and design of a variety of use cases in which the architecture can be useful: smart cities, smart grids, environment monitoring, industry 4.0 or data networks for instance.

41

Fig. 7. Scalability of the SATEC’s architecture.

ACKNOWLEDGMENT

[8]

This work has been developed within the FP7 ONTIC project, which has received founding from the European Union's Seventh Framework Programme (FP7/2007-2011) under grant agreement no. 619633.

[9]

Thanks to Isabel Muñoz (Associate professor at Technical University of Madrid) for her comments and support.

[10]

REFERENCES [1]

[2]

[3] [4] [5]

[6] [7]

[11]

Wampler, D.: Fast Data: Big Data Evolved. Lightbend white paper. https://info.typesafe.com/rs/558-NCX-702/images/COLL-white-paperfast-data-big-data-evolved.pdf Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347. C. Aggarwal. Data Streams: Models and Algorithms. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2007. Online Network Traffic Characterization (ONTIC) Project Portal http://ict-ontic.eu Cisco, http://www.cisco.com/c/en/us/td/docs/net_mgmt/netflow_collection_eng ine/3-6/user/guide/format.html#wp1003394. Jacobs, A. (2009). The pathologies of big data. Communications of the ACM, 52(8), 36-44. Scott Jarr: Fast Data and the New Enterprise Data Architecture. O'Reilly (November 2014)

[12]

[13] [14] [15] [16]

[17] [18]

42

Mishne, G., Dalton, J., Li, Z., Sharma, A., & Lin, J. (2013, June). Fast data in the era of big data: Twitter's real-time related query suggestion architecture. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (pp. 1147-1158). ACM. Lam, W., Liu, L., Prasad, S. T. S., Rajaraman, A., Vacheri, Z., & Doan, A. (2012). Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12), 1814-1825. Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., & Sears, R. (2010, June). Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing (pp. 143154). ACM. S. Babu and J. Widom. Continuous queries over data streams. SIGMOD Record, 2001. H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, E. Galvez, J. Salz, M. Stonebraker, N. Tatbul, R. Tibbetts, and S. Zdonik. Retrospective on aurora. VLDB Journal Special Issue on Data Stream Processing, 2004 ELK stack. https://www.elastic.co/products Apache Kafka. http://kafka.apache.org/ Apache Zookeeper. https://zookeeper.apache.org/ Amdahl, G.M.. Validity of single-processor approach to achieving largescale computing capability, Proceedings of AFIPS Conference, Reston, VA. 1967. pp. 483-485 Gustafson, J.L., Reevaluating Amdahl's Law, CACM, 31(5), 1988. pp. 532-533. Gupta, A., & Kumar, V. (1993). Isoefficiency function: a scalability metric for parallel algorithms and architectures. IEEE Transactions on Parallel and Distributed Systems, 4(8), 922-932.

Suggest Documents