Providing Big Data Applications with Fault-Tolerant Data Migration ...

Providing Big Data Applications with Fault-Tolerant Data Migration Across Heterogeneous NoSQL databases Marco Scavuzzo, Damian A. Tamburri, Elisabetta Di Nitto Politecnico di Milano, Italy

BIGDSE ‘16 – May 16th 2016, Austin

NoSQLs and Big Data applications Highly-available, big data applications need specific storage technologies: Distributed File Systems – DFSs (e.g., HDFS, Ceph, etc.) NoSQL databases (e.g., Riak, Cassandra, MongoDB, Neo4j, etc.) NoSQLs preferred to DFSs for:

Efficient data access (for reads and/or writes) Concurrent data access Adjustable data consistency and integrity policies Logics (filter, group, aggregate) in the data layer in place of the application layer (Hive, Pig, etc.). 2

NoSQLs heterogeneity Lack of standard data access interfaces and languages Lack of common data models (e.g., data types, secondary indexes, integrity constraints, etc.) Different architectures leading to different ways of approaching important problems (e.g., concurrency control, replication, transactions, etc.)

3

Vendor lock-in “The lack of standards due to most NoSQLs creating their own APIs [..] is going to be a nightmare in due course of time w.r.t. porting applications from one NoSQL to another. Whether it is an open source system or a proprietary one, users will feel locked in.” C. Mohan

4

Research objective Provide a method and supporting architecture to aid fault-tolerant data migration across heterogeneous NoSQL databases for Big Data applications

Hegira4Cloud 5

Hegira4Cloud requirements 1. Big Data migration across any NoSQL database and Database as a Service (DaaS) 2. High performant data migration 3. Fault-tolerant data migration

6

Hegira4Cloud approach Conversion to the Metamodel Format

Migration System Core Source DB

SRC

MIGRATION QUEUE

TWC

Target DB

Conversion from the Metamodel Format

7

Hegira4Cloud V1 Migration System Core Source DB

SRC

MIGRATION QUEUE

TWC

Target DB

Monolithic architecture data migration GAE Datastore -> Azure Tables dataset #1

dataset #2

dataset #3

Source size (MB)

16

64

512

# of Entities

36940 ~18m 147758 ~71m 1182062 ~568m

Migration time (sec)

1098

4270

34111

Entities throughput (ent/s)

33.643

34.604

34.653

Avg. %CPU usage

4.749

3.947

4.111

8

Improving performance: components decoupling Source DB

SRC

MIGRATION QUEUE

TWC

Target DB

Components decoupling helps in: distributing the computation (conversion to/from the intermediate meta model); isolating possible bottlenecks; finding (and solving) errors.

9

Improving performance: parallelization Source DB

SRC SRC SRC

MIGRATION QUEUE

TWC TWC TWC

Target DB

Operations to be executed can be parallelized: data extraction (from the source database) data should be partitionable data load (to the target database)

10

Improving performance: TWC parallelization Source DB

SRC SRC SRC

MIGRATION QUEUE

TWC TWC TWC

Target DB

Challenges: avoid to duplicate data (i.e., process disjunct data only once) avoid threads starvation in case of fault, already extracted data shouldn’t be lost Solution: RabbitMQ messages distributed (disjointly) in round-robin fashion messages correctly processed are acknowledged and removed messages are persisted on disk

11

Improving performance: SRC parallelization Source DB

SRC SRC SRC

METAMODEL QUEUE

TWC TWC TWC

Target DB

Challenges: complete knowledge of stored data is needed to partition data partitions should be processed at most once (to avoid duplications)

12

Improving performance: SRC parallelization Source DB

Lets assume that data are associated VDP1 with an unique, incremental primary key (orVDP2 an indexed property) VDP3

SRC SRC SRC

MIGRATION QUEUE

TWC TWC TWC

Target DB

Source DB 1 … 10 11 …

References to the VDPs are stored inside a persistent storage

20 21 … 30

13

Addressing faults Source DB

SRC

MIGRATION QUEUE

TWC

Target DB

Types of (non-trivial) faults: Database faults Components faults Network faults

Connection loss

On connection loss, not all databases guarantee a unique pointer to the data (e.g., Google Datastore) 14

Virtual data partitioning Source DB Key

Values

PARTITION STATUS

1 VDP1

2 …

not mig.

migrate

under mig.

finish_mig migrated

10 11 VDP2

VDP3

Status Log

… 20

VDPid

Status

21

1

migrated

…

2

under_mig

30

3

not_mig

ZooKeeper 15

Hegira4Cloud V2 STATUS LOG Source DB

SRC

MIGRATION QUEUE

TWC

Target DB

16

Hegira4Cloud V2: Evaluation Parallel distributed architecture

Monolithic arcitecture

dataset #1 dataset #2 dataset #3 dataset #1 Source size (MB)

16

64

512

318464 (311GB)

# of Entities

36940

147758

1182062

~107M

Migration time (sec) Entities throughput (ent/s)

1098

4270

34111

124867 (34½h)

33.643

34.604

34.653

856.41

Avg. %CPU usage

4.749

3.947

4.111

49.87

1 Source Reading Thread 40 Target Writing Threads 17

Conclusions Efficient, fault-tolerant method for data migration Architecture supporting data migration across NoSQL databases Supporting several databases (Azure Tables, Cassandra, Google Datastore, HBase) Evaluated on industrial case study

Future work Support online data migrations Rigorous tests for assessing data completeness and correctness

19

Marco Scavuzzo PhD student @ Politecnico di Milano You can find me at: [email protected]

Credits Presentation template by SlidesCarnival

Providing Big Data Applications with Fault-Tolerant Data Migration ...

Providing Big Data Applications with Fault-Tolerant Data Migration ...

Suggest Documents

Providing Big Data Applications with Fault-Tolerant Data Migration ...

Big Data For Development: Applications and ... - Big Data Analytics

big data and applications

Learning With Big Data

Big Data Migration between Data Centers in Online ... - Science Direct

Big Data Migration between Data Centers in Online ... - ScienceDirect

Highly Efficient Data Migration and Backup for Big Data Applications in ...

Data Mining with Big Data - Google Sites

Data Processing for Big Data Applications using Hadoop ... - ijarcce

Data Processing for Big Data Applications using Hadoop ... - ijarcce

Big Data Applications using Workflows for Data Parallel ... - UMBC

Data Processing for Big Data Applications using Hadoop ... - IJARCCE

Data Placement for Privacy-Aware Applications over Big Data in

Addressing Data Veracity in Big Data Applications - Semantic Scholar

Addressing Data Veracity in Big Data Applications - Semantic Scholar

Migration Data

Big Data: Technologies, Trends and Applications - International ...

Social computing and big data applications - Springer

big data on real-world applications

Accelerating Big Data Applications Using ... - Semantic Scholar

Geospatial Applications of Big Data Analytics - asprs

Managing Lifecycle of Big Data Applications - AKSW

IRJET- Basics and Applications of Big Data

Exploring Big Data Applications for Public Resource