Application of Big Data, Fast Data and Data Lake ...

4 downloads 52996 Views 1MB Size Report
Among many other big data application areas there are two related to IS, where ... 3) variety – weak structured data, which is primarily understand as data ...
BigR&I 2016

Application of Big Data, Fast Data and Data Lake Concepts to Information Security Natalia Miloslavskaya and Alexander Tolstoy National Research Nuclear University MEPhI (Moscow Engineering Physics Institute) “Information Security of Banking Systems” Department

Vienna, 23 August 2016

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

CONTENT Introduction 1. Big Data Concept. 2. Data Lakes Concept. 3. Fast Data Concept.

BigR&I 2016

Conclusion

Vienna, 23 August 2016

2

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

Big IT Infrastructure (ITI) Security-Related Data • coming from the separate domain controllers, proxy servers, DNS servers, information protection tools (IPT) • describing the current configuration of network devices, generalized characteristics of network traffic and telecommunications, application and network services functioning, activity and specific actions of individual end-users • containing e-mails, phone records, web-based content, Website click streams, metadata, digitized audio and video, video surveillance, GPS locations • the data of business processes, enterprise’s internal documents and analytical data • time-series data for many years of an enterprise’s existence

=> Volumes and heterogeneity of data on information security (IS) events, ITI assets, their vulnerabilities, users, IS threats… Vienna, 23 August 2016

BigR&I 2016

Introduction (1/4)

3

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

Introduction (2/4)

1)

Providing big data IT as services (ready functional modules) in the implementation of other IT (in particular, search technology, deep data analytics to identify hidden patterns), the primary sources of information search and retrieval of the main content (semantics) in the extra-large arrays of documents without their direct reading by a human, etc.

2)

Analytical processing of data about the ITI’s state to identify anomalies in the system functioning IS incidents and intrusion prevention, etc.

Vienna, 23 August 2016

BigR&I 2016

Among many other big data application areas there are two related to IS, where amalgamation of big data plus real-world insight are working together:

4

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

Introduction (3/4)

1) processed correctly and promptly: • to identify, structure, consolidate and visualize IS threats, vulnerabilities to be eliminated and IS incidents occurred • to optimize ITI monitoring strategy and resources • to calculate current and forecast further IS risks  to make timely and informed decisions on ITI IS … 2) evaluated from a viewpoint of any attack: • to find its source, consider its type, weight its consequences, visualize its vector, associate all target systems, prioritize countermeasures and offer mitigation solutions with weighted impact relevance. It is a must to maintain the recorded relationships of every file execution/modification, registry modification, network connection, executed binary in your environment, etc. Moreover, it is a data stream with the following unique features: huge or possibly infinite volume, dynamically changing, flowing in and out in a fixed order, demanding fast (often real-time) response time, etc. Vienna, 23 August 2016

BigR&I 2016

Big ITI Security-Related Data should be

5

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

Introduction (4/4)

The standard terminology in the field of big data has not yet been developed at present. First of all we had data.

Are they simply the new marketing labels for the old Big Data IT or really new ones?

=> Our goal: to identify the relationship between these three concepts. Vienna, 23 August 2016

BigR&I 2016

Now we witness the appearance of another two concepts: data lakes and fast data.

6

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

1. Big Data Concept (1/5)

7 “V”: 1) volume – very large volumes of data 2) velocity – very high data transfer rate 3) variety – weak structured data, which is primarily understand as data structure irregularity and difficulty of extracting homogeneous data from a stream and identifying some correlations 4-7) veracity, variability, value and visibility Vienna, 23 August 2016

BigR&I 2016

Big Data – the datasets of such size and structure that exceed the capabilities of traditional programming tools (databases, software, etc.) for structured, semi-structured and unstructured data collection, storage and processing in a reasonable time and afortiori exceed the capacity of their perception by a human. All that makes it impossible to manage and process data effectively in a traditional way.

7

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

Types of big data processing: 1) batch processing in pseudo real (soft) real-time, when data already stored in the nonvolatile memory are processed, and probability and time characteristics of data conversion process are mainly determined by the requirements of the applied problems 2) stream processing in hard real-time, when data stored in RAM without saving to nonvolatile storage media are processed, and probability and time characteristics of data conversion process are mainly determined by incoming data rate, since the appearance of the queues at the processing nodes leads to irreversible loss of data 3) hybrid computational approach based on Lambda Architecture with three layers: batch, serving and speed. Big data ~ continuous flowing substance, processing and securing mechanisms for which must be built in the streams themselves. Big data IT ~ data-centric (data-driven) IT, processing very large-scale arrays of semi-structured data in real-time. Vienna, 23 August 2016

BigR&I 2016

1. Big Data Concept (2/5)

8

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

1. Big Data Concept (3/5)

Big data IT theoretical basis - data science - a section of computing, including the following: 1) development of methodology for distributed file systems and converting datasets to create procedures for parallel and distributed processing of very large data amounts 2) similarity search, including key minhashing techniques (search of intersection in an array’s subsets) and locality-sensitive hashing 3) data-stream processing and specialized algorithms for fast arriving data that must be processed immediately

5) frequent-itemset data mining, including associative rules, market-baskets, the a-priori algorithm and its improvements 6) very large, high-dimensional datasets clustering algorithms 7) Web applications problems: managing advertising/recommendation systems 8) algorithms for analyzing and mining the structure of very large graphs, especially social-network graphs 9) techniques for obtaining the important properties of a large dataset by dimensionality reduction, including singular-value decomposition and latent semantic indexing

10) machine-learning algorithms that can be applied to very large-scale data, such as perceptrons, supportvector machines, and gradient descent. Vienna, 23 August 2016

BigR&I 2016

4) search engine technology for large-scale datasets and ranking search results, link-spam detection, and the hubs-and-authorities approach

9

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

1. Big Data Concept (4/5)

Some important characteristics of big IS-related data

 Be accurate: data needs to be correct and get from a reliable (trusted) source  Be timely: data must be current and reflect up-to-date ITI’s IS level, IS threats, vulnerabilities, IS controls and solutions, and, if necessary, the historic data should be added in due course

 Be comprehensive: data needs to be collected into a model that paints a full picture, is protect an organization’s ITI’s assets immediately

 Be tailored: data should be tailored towards a specific ISM purpose in achieving ITI’s asset IS  Be relevant: data must be applicable to and actual for the organization using it… Vienna, 23 August 2016

BigR&I 2016

flexible integrated and easily distilled into useful and meaningful information – to help

10

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

1. Big Data Concept (5/5)

IS-related data mining tasks can be classified as 1) descriptive, characterizing the general properties of the data in the database (DB) 2) predictive, performing inference on the current data in order to make predictions.

data characterization; association and correlation analysis; classification; prediction; cluster, outlier and evolution analysis

for security operators and analysts

: IS events and incidents, vulnerabilities, actual network attacks, IS risks, trends and so on. Such knowledge should be evidencebased and inter alia includes metadata, constraints or thresholds and concept hierarchies, used to organize attributes or their values into different levels of abstraction. Vienna, 23 August 2016

BigR&I 2016

interestingness measures and thresholds to filter out discovered patterns

11

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

2. Data Lake Concept (1/5)

2010: a concept of «data lakes» or «data hubs» has been introduced by James Dixon. Sometimes it is considered as being simply a marketing label for a product that supports Hadoop or yesterday's unified storage is today's enterprise data lake.

~ a large data pool to bring in all of the historical data (collected and accumulated, about past events and circumstances pertaining to a particular subject) and new data (structured, unstructured, semi-structured + binary from sensors, devices and so on) in near-real time into one single place, in which the schema and data requirements are not defined until the data is queried («schema-on-read» is used).

Vienna, 23 August 2016

BigR&I 2016

A data lake: a massively scalable storage repository that holds a vast amount of raw data in its native format («as is») until it is needed + processing systems (engine) that can ingest data without compromising the data structure.

12

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

2. Data Lake Concept (2/5)

1. They are typically built to handle large and quickly arriving volumes of unstructured data (in contrast to data warehouses (DWH) processing highly structured data) from which further insights are derived => they use dynamic (not pre-build static like in DWH) analytical applications. 2. They use a flat architecture, where each data element has a unique identifier and a set of extended metadata tags (in contrast to a hierarchical DWH with files or folders data storage). 3. The data in the lake becomes accessible as soon as it is created (in contrast to DWH designed for slowly changing data). 4. They require maintaining the order of the data arrival. 5. They often include a semantic DB, a conceptual model that leverages the same standards and technologies used to create Internet hyperlinks, and add a layer of context over the data that defines the meaning of the data and its interrelationships with other data. 6. The data lake strategies can combine SQL and NoSQL approaches and online analytics and transaction processing (OLAP and OLTP) capabilities. Vienna, 23 August 2016

BigR&I 2016

Another data lakes’ features:

13

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

2. Data Lake Concept (3/5)

The data lake can be divided into three separate tiers for: 1) for raw data 2) augmented daily data sets 3) third-party information.

=> The data lake serves as a cost-effective place to conduct preliminary analysis of data, while flexible and task-oriented data structuring is implemented only where and for what it is necessary. The data lake outflow is the analyzed data. The data lake should be integrated with the rest of the enterprise’s ITI. This requires the initial cataloguing and indexing of the data as well as data security. Vienna, 23 August 2016

BigR&I 2016

Another approach (according to data lifetime): 1) data that is less than 6 months old 2) older but still active data 3) archived data no longer used but needs to be retained (this stale data can be moved to slower, less expensive media).

14

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

2. Data Lake Concept (4/5) 1) A scale-out architecture with high availability that grows with the data 2) Governance and enforcing policies for retention and disposition, identification of data to be tired 3) A centralized cataloging and indexing of the inventory of data (and metadata) that is available, including sources, versioning, veracity and accuracy 4) Data cardinality means how it relates to other data 5) Data transformation lineage (tracking) – what was done with it, when and where it came from (the evaluation of internal, external, and acquired third party’s data sources), who and why changed it, what versions are exist, how long it will be useful or relevant, etc. 6) A single easy to manage and fully shareable data store being accessible to all the applications (instead of creating silos for new file, mobile, cloud and Hadoop workflows, and copies of data) 7) A shared-access model so that each bit of data would be simultaneously accessible in multiple formats to eliminate the extract, transform and load process and allow data-in-place analytics, accelerated workflow support between disparate applications, etc. 8) Access from any device (a tablet, smartphone, laptop, desktop..) to support mobile workforce 9) Agile analytics into and from the data lake using multiple analytical approaches and data workflows as well as single subject analytics based on very specific use cases 10) Some level of quality of service with securely isolate consolidated workflows in their own zones within the system for safeguarding or performance 11) Efficiency including erasure coding, compression, deduplication; etc. Vienna, 23 August 2016

BigR&I 2016

A few very important characteristics should be supported for data in the lakes:

15

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

2. Data Lake Concept (5/5)

The IS-related data going into a lake contain logs and sensor data (e.g., from the Internet of Things), low-level customer behavior (e.g., Website click streams), social media, document collections (e.g., e-mail and customer files), geo-location trails, images, video and audio and another data useful for integrated analysis.

The IS-related data lake governance includes application framework to

The advanced IS-related metadata management combines working with rapidly changing data structures, as well as sub-second query response on highly structured data.

For the data lake itself as it is a single raw-data store ensuring its operational availability, integrity, access control, authentication and authorization, monitoring and audit, business continuity and disaster recovery is of great importance. Vienna, 23 August 2016

BigR&I 2016

capture and contextualize IS-related data by cataloging and indexing and further advanced metadata management. It helps to collaboratively create models (views) of this data and then gain more visibility and manage incremental improvements to the metadata.

16

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

3. Fast Data Concept (1/3) In today’s connected and interactive dynamic world the enterprises’ streams of high-velocity data from sensors, actuators and machine-to-machine communication in the Internet of Things (IoT) and modern networks are very large. Managing and extracting value from the IoT data is a pressing challenge for the enterprises and it has become vital for them to identify what data is time-sensitive and should be acted upon right away and, vice versa, what data can sit in a DB or data lake until there is a reason to mine it.

Fast data corresponds to the application of big data analytics to smaller data sets in near-real

The goal of fast data analytics is to quickly gather and mine structured and unstructured data from thousands to millions of devices so that action can be taken. Fast data often comes into data systems in streams and it is more emphasized on processing big data streams at speed.

Vienna, 23 August 2016

BigR&I 2016

or real-time in order to solve a particular problem. They play an important role in applications that require low latency and depend upon the high input/output capability for rapid updates.

17

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

3. Fast Data Concept (2/3)

=> Fast data processing can be described as  ingest (get millions of events per second)  decide (make a data-driven decision on each event)  analyze in real time (to enable automated decision-making and provide visibility into operational trends of the events). The Lambda Architecture defines a robust framework for ingesting streams of fast data which also provides efficient real-time and historical analytics. In this framework all the data flows only in one direction: into the system. The architecture’s main goal is to execute OLAP faster. Some fast data applications rely on rapid batch data while others require real-time streams. Vienna, 23 August 2016

BigR&I 2016

Fast data requires two technologies: 1) A streaming system capable of delivering events as fast as they come in (the combination of inmemory DBs and data grid on top of flash devices will allow an increase in the capacity of stream processing) 2) A data store capable of processing each item as fast as it arrives (new flash drives are ready for breaking the current speed limit which is bounded mostly by the performance of hard drive devices).

18

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

3. Fast Data Concept (3/3) Potential use cases for fast data: o smart surveillance cameras that can continuously record events and use predictive analytics to identify and flag security anomalies as they occur o smart grid applications that can analyze real-time electric power usage at tens-of-thousands of locations and automatically initiate load shedding to balance supply with demand in specific geographical areas.

Conclusions:

2) Currently, volumes of IS-related information is one thing, but the real problem for securing ITI’s assets is the speed with which things related to IS happen. That is why even all ISrelated data can be regarded as fast data as it requires immediate corresponding security measures to be activated.

Vienna, 23 August 2016

BigR&I 2016

1) Fast data is a complimentary approach to big data for managing large quantities of «in-flight» data. Interacting with fast data radically differs from interacting with big data at rest and requires systems that are architected differently.

19

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

Conclusion (1/2)  The big data is data at rest, while the fast data is data in motion.  Sometimes fast data is big (when we should process high volumes of fast data). Consequently, these two concepts have the intersection.  While comparing the big data and the data lakes the conclusion is that the second concept evolutionary continues the first one on a higher spiral’s turn.

BigR&I 2016

 The data lake contains all of the enterprise’s data, including raw data, intermediate and final results of its processing.  The final picture of the three concepts interrelation:

Vienna, 23 August 2016

20

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

Conclusion (2/2)

BigR&I 2016

 This picture can be logically added by another one which shows three concepts interrelation from the view point of the modern enterprise’s data architecture supporting operation of these concepts in real world.

Vienna, 23 August 2016

21

APPLICATION OF BIG DATA, FAST DATA AND DATA LAKE CONCEPTS TO INFORMATION SECURITY

BigR&I 2016 Natalia Miloslavskaya

BigR&I 2016

[email protected]

Vienna, 23 August 2016

22

Suggest Documents