From Oracle to MongoDB - NoSQL Matters BCN 2012

11 downloads 88 Views 2MB Size Report
Oct 6, 2012 - Generate a temporal DB table with format similar to final batch file. Two loops ..... Reduced them to just
From Oracle to MongoDB A real use case at Telefónica PDI

Pablo Enfedaque [email protected] 06.10.2012

Content Introduction 01

• Telefónica PDI. Who? • Personalisation Server. Why? What? The SQL version

02

• Data model and architecture • Integrations, problems and improvements The NoSQL version

03

• Data model and architecture • Performance boost • The bad Conclusions

04

• Conclusions • Personal thoughts

01

Introduction Título del capítulo Máximo 3 líneas

01

Telefónica PDI. Who?

•  Telefónica

§  Fifth largest telecommunications company in the world §  Operations in Europe (7 countries), the United States and Latin America (15 countries)

•  Telefónica Digital §  Web and mobile digital contents and services division

•  Product Development and Innovation unit §  Formerly Telefónica R&D §  Product & service development, platforms development, research, technology strategy, user experience and deployment & operation §  Around 70 different on going projects at all time.

Telefónica PDI

4

01

Personalisation Server. What?

•  User profiling system •  Machine learning •  Recommendations •  Customer’s profile storage

Telefónica PDI

5

01

Opt-in and profile module. Why?

•  Users data, profile and permissions, was scattered across different storages

IPTV service Mobile service

• Gender • Film and music preferences • Permission to contact by SMS? • Gender

Music tickets service

• Address • Music preferences

Location based offers

• Address • Permission to contact by SMS?

Telefónica PDI

6

So you want to know my address… AGAIN?!

01

Opt-in and profile module. Why?

•  Users data, profile and permissions, was scattered across different storages

IPTV service Mobile service

• Gender • Film and music preferences • Permission to contact by SMS? • Gender

Music tickets service

• Address • Music preferences

Location based offers

• Address • Permission to contact by SMS?

Telefónica PDI

7

01

Opt-in and profile module. Why?

•  Provide a module to become master customer’s data storage

IPTV service Mobile service

•  Gender •  Film and music preferences •  Permission to contact by SMS? •  Address

Music tickets service Location based offers

Telefónica PDI

8

01

Opt-in and profile module. What?

•  Features: §  Flexible profile definition, classified in services §  Profile sharing options between different services §  Real time API §  Supplementary offline batch interface §  Authorization system §  High availability §  Inexpensive solution & hardware

Telefónica PDI

9

02

The solution TítuloSQL del capítulo Máximo 3 líneas

02

Data model Services, users and their profile

•  Services defined a set of attributes (their profile), with default value and data type •  Users were registered in services •  Users defined values for some of the services attributes •  Each attribute value had an update date to avoid overwriting newer changes through batch loads

Telefónica PDI

11

02

Data model Services profile sharing matrix

•  Services could access attributes declared inside other services •  There were sharing rights for read or read and write •  The user had to be registered in both services

Telefónica PDI

12

02

Data model Authorization system

•  Everything that could be accessed in the PS was a resource •  Roles defined access rights (read or read and write) of resources •  Auth users had roles •  Roles could include other roles

Telefónica PDI

13

02

Data model Bonus features!

•  Multiple IDS:

§  Users profile could be accessed with different equivalent IDs depending on the service §  Each user ID was defined by an ID type (phone number, email, portal ID, hash…) and the ID value

Telefónica PDI

14

02

High level logical architecture

§  Everything running on Red Hat EL 5.4 64 bits Telefónica PDI

15

02

High level logical architecture

§  Everything running on Red Hat EL 5.4 64 bits Telefónica PDI

16

02

Integration Planned integration

•  PS replaces all customers profile and permissions DBs

•  All systems access this data through PS real time API

•  In special cases, some PS-consumers could use the batch interface.

•  The same way new services could be added quite easily

Telefónica PDI

17

02

Integration Problems arise

•  Budget restrictions: adapt all services to use the API was too expensive

•  Keep independent systems DBs and synchronize PS through batch

•  Use DBs built-in massive extraction feature to generate daily batch files

•  However… in most cases those DBs

were not able to generate Delta (only changes) extractions §  Provide full daily snapshots!

Telefónica PDI

18

02

First version performance Ireland

•  1.8M customers, 180 profile attributes, 6 services •  Sizes §  Tables + indexes size: 65Gb §  30% of the size were indexes

•  Batch §  Full DWH customer’s profile import: > 24 hours §  Delta extractions: 4 - 6 hours §  Loads and extractions performance proportional to data size

•  API: §  Response time with average traffic: 110ms

Telefónica PDI

19

03

The solution TítuloSQL del capítulo Second Máximo 3version líneas

03

Second version High level logical architecture

•  New approach: batch processes access directly DB Telefónica PDI

21

03

Second version Batch processes

•  Batch processes had to §  Validate authentication and authorization §  Verify user, service and attribute existence §  Check equivalent IDs §  Validate sharing matrix rights §  Validate values data type §  Check the update date of the existing values

Telefónica PDI

22

03

Second version DB Batch processing

s A B D r u O

Telefónica PDI

23

03

Second version New DB-based batch loading process

•  Preprocess incoming batch file in BE servers

§  Validate format, services and attributes existence and values data types §  Generate intermediate file with structure like target DB table

•  Load intermediate file (Oracle’s SQL*Loader) to a temporal table •  Switch DB to “deferred writing”, storing all incoming modifications •  Merge temporal table and final table, checking values update date •  Replace old users attributes values table with merge result •  Apply deferred writing operations Telefónica PDI

24

03

Second version New batch extraction process

•  Generate a temporal DB table with format similar to final batch file. Two loops over users attributes values table required:

§  Select format of the table; number and order of columns / attributes §  Fill the new table

•  Loop the whole temporal table for final formatting (empty fields…) •  From batch side loop across the whole table (SELECT * FROM …)

•  Write each retrieved row as a line in the resulting file

Telefónica PDI

25

03

Second version performance Ireland performance requirements

•  Batch time window: 3:30 hours §  Full DWH load §  Two Delta loads §  Three Delta extractions

•  API: §  Ireland requirement: < 500ms

Telefónica PDI

26

03

Second version performance Ireland

•  1.8M customers, 180 profile attributes, 6 services •  Sizes §  §  §  § 

Tables + indexes size: 65Gb 30% of the size were indexes Temporal tables size increases almost exponentially: 15Gb and above Intermediate file size: from 700Mb to 7Gb

§  §  §  §  § 

Full DWH customer’s profile import: 2:30 hours Delta extractions: 1:00 hour Loads performance worsened quickly (almost exp): 6:00 hours Extractions performance proportional to data size Concurrent batch processes may halt the DB

•  Batch

•  API:

§  Response time with average traffic: 80ms §  Response time while loading was unpredictable: >300ms Telefónica PDI

27

04

The solution TítuloSQL del capítulo Third Máximoversion 3 líneas

04

Third version Speed up DB Batch processes

) n i a ag ( s BA D r Ou Telefónica PDI

29

04

Third version New (second) DB-based batch loading process

•  Minor preprocessing of incoming batch file in BE servers §  Just validate format §  No intermediate file needed!

•  Load validated file (Oracle’s SQL*Loader) to a temporal table

•  Loop the temporal table merging the values into final table, checking values update date and data types

§  Use several concurrent writing jobs

•  Store results on real table, no need to replace! •  No “deferred writing”! Telefónica PDI

30

04

Third version Enhancements to extraction process

•  Optimized loops to generate temporal output table. §  Use several concurrent writing jobs §  We achieved a speed-up of between 1.5 and 2

•  Loop the whole temporal table for final formatting (empty fields…)

•  Download and write lines directly inside Oracle’s sqlplus •  No SELECT * FROM … query from Batch side!

Telefónica PDI

31

04

Third version performance Ireland

•  1.8M customers, 180 profile attributes, 6 services •  Sizes §  Tables + indexes size: 65Gb §  30% of the size were indexes §  Temporal tables: 15Gb

•  Batch §  §  §  § 

• 

Full DWH customer’s profile import: 1:10 hours (vs. 2:30 hours) Three Delta extractions: 2:15 hours (vs. 3:00 hours) Loads and extractions performance proportional to data size Concurrent batch processes not so harmful As B D Our API: §  Response time with average traffic: 110ms §  Response time while loading: 400ms

Telefónica PDI

32

F**K YEAH

04

Third version performance United Kingdom

•  25M customers, 150 profile attributes, 15 services •  Sizes §  Tables + indexes size: 700Gb §  40% of the size were indexes

•  Batch §  Two Delta imports: < 2:00 hours §  Two Delta extractions: < 2:00 hours §  Loads and extractions performance proportional to data size

•  API: §  Response time with average traffic: 90ms

Telefónica PDI

33

As B D Our

F**K YEAH

04

Third version performance Ireland DB size

3rd version

2nd version

65Gb + 15Gb (temp)

65Gb + > 15Gb

Full DWH load

1:10 hours

2:30 hours

Three Delta exports

2:15 hours

3:00 hours

Batch stability

Stable, linear Unstable, exponential

API response time

110ms

110ms

API while loading

400ms

Unpredictable

United Kingdom DB size

3rd version 700Gb

Two Delta loads

< 2:00 hours

Three Delta exports

< 2:00 hours

API response time Telefónica PDI

90ms 34

As B D Our

F**K YEAH

04

Third version performance DB stats

•  20 database tables •  API: several queries with up to 35 joins and even some unions •  Authorization: 5 joins to validate auth users access •  Batch: §  Load: 1700 lines of PL/SQL §  Extraction: 1200 of PL/SQL

Telefónica PDI

35

04

Mission completed?

Telefónica PDI

36

04

Third version performance Mexico

•  20M customers, 200 profile attributes, 10 services •  Mexico time window: 4:00 hours §  Full DWH load! §  Additional Delta feeds loads §  At least two Delta extractions

s A B D Our

Telefónica PDI

37

05

The solution TítuloNoSQL del capítulo Máximo 3 líneas

05

MongoDB Data Model Services and their profile + sharing matrix { _id : 7, service_name : "root", id_type : 1, default_values: false, attrib_id = service_id * 10000 + num attribs + 1 owned_attribs : [ { attrib_id : 70005, attrib_nane : “marketing.consent", attrib_data_type : 1, attrib_def_value : "no", attrib_status : 1 }, ... attrib_id = service_id * 10000 + num attribs + 1 ], shared_attribs : [ {attrib_id : 20144, sharing_mode : 0}, ... ] }

Telefónica PDI

39

05

MongoDB Data Model Users and their profile + multiple IDs {

Equivalent ID document: _id : "011234"

services_list : _id = “id type” + “user ID” { [ _id : “05abcd" { ue : "011234" service_id : 1, } reg_date : {"$date" : 1318040693000} }, _id = “id type” + “user ID” ... ], user_values : attrib_id = service_id * 10000 + num attribs + 1 [ { attrib_id : 10140, attrib_value : "Open", update_date : {"$date" : 1317110161000} }, ... ]

} Telefónica PDI

40

05

MongoDB Data Model Authorization system AUTH USERS COLLECTION:

{ _id: "admin" auth_pswd: ”XXX", auth_roles: ['PS_ADMIN_ROLE’, …], auth_uris: [ {uri_path: "/**", method: 'R'}, {uri_path: "/stats/**", method: 'RW'}, {uri_path: "/kpis/**", method: ’IMPORT'}, ... ] }

Replicate uris (from resources) and methods (from roles)

Telefónica PDI

41

ROLES COLLECTION:

{ _id: 'PS_ADMIN_ROLE', roles_resources: [ { resource_id: "admin.**”, method: 'R' }, { resource_id: "stats.**”, method: 'IMPORT' }, ... ] } RESOURCES COLLECTION:

{ _id: "admin.**", role_uri: "/**" }

05

MongoDB Data Model DB stats

•  Only 5 collections •  API: typically 2 accesses (services and users collections) •  Authorization: access only 1 collection to grant access •  Batch: all processing done outside DB

Telefónica PDI

42

05

NoSQL version High level logical architecture

§  Everything running on Red Hat EL 6.2 64 bits Telefónica PDI

43

05

NoSQL version performance Ireland (at PDI lab)

•  1.8M customers, 180 profile attributes, 6 services •  Sizes §  Collections + indexes size: 20Gb (vs. 65Gb) §  < 5% of the size are indexes (vs. 30%)

•  Batch §  §  §  § 

Full DWH customer’s profile import: 0:12 hours (vs. 1:10 hours) Three Delta extractions: 0:40 hours (vs. 2:15 hours) Loads and extractions performance proportional to data size Concurrent batch processes without performance affection

•  API: §  Response time with average traffic: < 10ms (vs. 110ms) §  Response time while loading: the same §  High load (600 TPS) response time while loading: 300ms Telefónica PDI

44

05

NoSQL version performance United Kingdom (at PDI lab)

•  25M customers, 150 profile attributes, 15 services •  Sizes §  Collections + indexes size: 210Gb (vs. 700Gb) §  < 5% of the size were indexes

•  Batch §  Two Delta imports: < 0:40 hours (vs. 2:00 hours) §  Loads and extractions performance proportional to data size

Telefónica PDI

45

05

NoSQL version performance Mexico

•  20M customers, 200 profile attributes, 15 services •  Sizes §  Collections + indexes size: 320Gb §  Indexes size: 1.2Gb

•  Batch §  Initial Full import (20M, 40 attributes): 2:00 hours §  Small Full import (20M, 6 attributes): 0:40 hours

•  API: §  Response time with average traffic: < 10ms (vs. 90ms) §  Response time while loading: the same §  High load (500 TPS) response time while loading: 270ms

Telefónica PDI

46

04

NoSQL version performance Ireland DB size

NoSQL version

SQL version

20Gb

80Gb

Full DWH load

0:12 hours

1:10 hours

Three Delta exports

0:40 hours

2:15 hours

< 10ms

400ms

300ms

Timeout / failure

API while loading API 600TPS + loading United Kingdom DB size Two Delta loads Mexico DB size

NoSQL version 210Gb

700Gb

< 0:40hours

< 2:00 hours

NoSQL version 320Gb

Initial Full load (40 attr)

2:00 hours

Daily Full load (6 attr)

0:40 hours

API while loading APIPDI 500TPS Telefónica

SQL version

+ loading

< 10ms 270ms 47

s A B D Our

05

Mission completed?

Telefónica PDI

48

05

The bad

•  Batch load process was too fast

§  To keep secondary nodes synched we needed oplog of 16 or 24Gb §  We had to disable journaling for the first migrations

•  Labels of documents fields take up disc space

§  Reduced them to just 2 chars: “attribute_id” -> “ai”

•  Respect the unwritten law of at least 70% of size in RAM •  Take care with compound indexes, order matters §  You can save one index… or you can have problems §  Put most important key (never nullable) the first one

•  DBAs whining and complaining about NoSQL

§  “If we had enough RAM for all data, Oracle would outperform MongoDB”

Telefónica PDI

49

05

The ugly

•  Second migration once the PS is already running

§  Full import adding 30 new attributes values: 10:00 hours §  Full import adding 150 new attributes values: 40:00 hours

•  Increase considerably documents size (i.e. adding lots of new values to the users) makes MongoDB rearrange the documents, performing around 5 times slower §  That’s a problem when you are updating 10k documents per second

•  Solutions? §  Avoid this situation at all cost. Run away! §  Normalize users values; move to a new individual collection §  Prealloc the size with a faux field •  You could waste space!

§  Load in new collection, merge and swap, like we did in Oracle Telefónica PDI

50

06

Conclusions Título del Título delcapítulo capítulo
 Máximo 3 3líneas Máximo líneas

06

Conclusions & personal thoughts

•  Awesome performance boost

§  But not all use cases fit in a MongoDB / NoSQL solution!

•  New technology, different limitations •  Fear of the unknown §  SSDs performance? §  Long term performance and stability?

•  Python + MongoDB + pymongo = fast development §  I mean, really fast

•  MongoDB Monitoring Service (MMS) •  10gen people were very helpful Telefónica PDI

52

06

Questions?

Telefónica PDI

53

0X

SQL Physical architecture

§  Scale horizontally adding more BE or DB servers or disks in the SAN §  Virtualized or physical servers depending on the deployment Telefónica PDI

55

0X

MongoDB Physical architecture

§  MongoDB arbiters running on BE servers §  Scale horizontally adding more BE servers or disks in the SAN §  Sharding may already be configured to scale adding more replica sets Telefónica PDI

56