Providing Big Data Applications with Fault-Tolerant Data Migration Across Heterogeneous NoSQL databases Marco Scavuzzo, Damian A. Tamburri, Elisabetta Di Nitto Politecnico di Milano, Italy
BIGDSE ‘16 – May 16th 2016, Austin
NoSQLs and Big Data applications Highly-available, big data applications need specific storage technologies: Distributed File Systems – DFSs (e.g., HDFS, Ceph, etc.) NoSQL databases (e.g., Riak, Cassandra, MongoDB, Neo4j, etc.) NoSQLs preferred to DFSs for:
Efficient data access (for reads and/or writes) Concurrent data access Adjustable data consistency and integrity policies Logics (filter, group, aggregate) in the data layer in place of the application layer (Hive, Pig, etc.). 2
NoSQLs heterogeneity Lack of standard data access interfaces and languages Lack of common data models (e.g., data types, secondary indexes, integrity constraints, etc.) Different architectures leading to different ways of approaching important problems (e.g., concurrency control, replication, transactions, etc.)
3
Vendor lock-in “The lack of standards due to most NoSQLs creating their own APIs [..] is going to be a nightmare in due course of time w.r.t. porting applications from one NoSQL to another. Whether it is an open source system or a proprietary one, users will feel locked in.” C. Mohan
4
Research objective Provide a method and supporting architecture to aid fault-tolerant data migration across heterogeneous NoSQL databases for Big Data applications
Hegira4Cloud 5
Hegira4Cloud requirements 1. Big Data migration across any NoSQL database and Database as a Service (DaaS) 2. High performant data migration 3. Fault-tolerant data migration
6
Hegira4Cloud approach Conversion to the Metamodel Format
Migration System Core Source DB
SRC
MIGRATION QUEUE
TWC
Target DB
Conversion from the Metamodel Format
7
Hegira4Cloud V1 Migration System Core Source DB
SRC
MIGRATION QUEUE
TWC
Target DB
Monolithic architecture data migration GAE Datastore -> Azure Tables dataset #1
dataset #2
dataset #3
Source size (MB)
16
64
512
# of Entities
36940 ~18m 147758 ~71m 1182062 ~568m
Migration time (sec)
1098
4270
34111
Entities throughput (ent/s)
33.643
34.604
34.653
Avg. %CPU usage
4.749
3.947
4.111
8
Improving performance: components decoupling Source DB
SRC
MIGRATION QUEUE
TWC
Target DB
Components decoupling helps in: distributing the computation (conversion to/from the intermediate meta model); isolating possible bottlenecks; finding (and solving) errors.
9
Improving performance: parallelization Source DB
SRC SRC SRC
MIGRATION QUEUE
TWC TWC TWC
Target DB
Operations to be executed can be parallelized: data extraction (from the source database) data should be partitionable data load (to the target database)
10
Improving performance: TWC parallelization Source DB
SRC SRC SRC
MIGRATION QUEUE
TWC TWC TWC
Target DB
Challenges: avoid to duplicate data (i.e., process disjunct data only once) avoid threads starvation in case of fault, already extracted data shouldn’t be lost Solution: RabbitMQ messages distributed (disjointly) in round-robin fashion messages correctly processed are acknowledged and removed messages are persisted on disk
11
Improving performance: SRC parallelization Source DB
SRC SRC SRC
METAMODEL QUEUE
TWC TWC TWC
Target DB
Challenges: complete knowledge of stored data is needed to partition data partitions should be processed at most once (to avoid duplications)
12
Improving performance: SRC parallelization Source DB
Lets assume that data are associated VDP1 with an unique, incremental primary key (orVDP2 an indexed property) VDP3
SRC SRC SRC
MIGRATION QUEUE
TWC TWC TWC
Target DB
Source DB 1 … 10 11 …
References to the VDPs are stored inside a persistent storage
20 21 … 30
13
Addressing faults Source DB
SRC
MIGRATION QUEUE
TWC
Target DB
Types of (non-trivial) faults: Database faults Components faults Network faults
Connection loss
On connection loss, not all databases guarantee a unique pointer to the data (e.g., Google Datastore) 14
Virtual data partitioning Source DB Key
Values
PARTITION STATUS
1 VDP1
2 …
not mig.
migrate
under mig.
finish_mig migrated
10 11 VDP2
VDP3
Status Log
… 20
VDPid
Status
21
1
migrated
…
2
under_mig
30
3
not_mig
ZooKeeper 15
Hegira4Cloud V2 STATUS LOG Source DB
SRC
MIGRATION QUEUE
TWC
Target DB
16
Hegira4Cloud V2: Evaluation Parallel distributed architecture
Monolithic arcitecture
dataset #1 dataset #2 dataset #3 dataset #1 Source size (MB)
16
64
512
318464 (311GB)
# of Entities
36940
147758
1182062
~107M
Migration time (sec) Entities throughput (ent/s)
1098
4270
34111
124867 (34½h)
33.643
34.604
34.653
856.41
Avg. %CPU usage
4.749
3.947
4.111
49.87
1 Source Reading Thread 40 Target Writing Threads 17
Conclusions Efficient, fault-tolerant method for data migration Architecture supporting data migration across NoSQL databases Supporting several databases (Azure Tables, Cassandra, Google Datastore, HBase) Evaluated on industrial case study
Future work Support online data migrations Rigorous tests for assessing data completeness and correctness
19
Marco Scavuzzo PhD student @ Politecnico di Milano You can find me at:
[email protected]
Credits Presentation template by SlidesCarnival