DATA MODELING AND INDEXING FOR APACHE ACCUMULO - Sqrrl

24 downloads 58 Views 4MB Size Report
Introduction to Sqrrl and Accumulo. 2. Security In The Wild. 3. Sqrrl and Accumulo Technology. 4. The Data-Centric Security Ecosystem. In our September ...
Securely explore your data

DATA MODELING AND INDEXING FOR APACHE ACCUMULO Sqrrl Webinar Series October, 2013 Adam Fuchs, CTO Sqrrl Data, Inc.

RECAP In our September Webinar: Sqrrl, Apache Accumulo, and Cell-Level Security

1.  2.  3.  4. 

Introduction to Sqrrl and Accumulo Security In The Wild Sqrrl and Accumulo Technology The Data-Centric Security Ecosystem

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

2  

TODAY’S DISCUSSION Data Modeling and Indexing for Apache Accumulo

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs 1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

3  

LAYERED ARCHITECTURE Turtles all the way down...

Sqrrl  API  over  Apache  Thri8  RPC   (JSON,  Graph,  Aggrega=on,   Search,  etc.)  

Sqrrl Enterprise Accumulo  RPC   (Sorted  Key/Value  I/O)  

Application Hadoop  RPC   (File  I/O)  

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

4  

ACCUMULO DATA FORMAT An Accumulo key is a 5-tuple, consisting of: " " " " "

         

Row: Controls Atomicity Column Family: Controls Locality Column Qualifier: Controls Uniqueness Visibility Label: Controls Access Timestamp: Controls Versioning

Row

Col. Fam.

Col. Qual.

Visibility

Timestamp

Value

John Doe

Notes

PCP

PCP_JD

20120912

Patient suffers from an acute …

John Doe

Test Results

Cholesterol

JD|PCP_JD

20120912

183

John Doe

Test Results

Mental Health

JD|PSYCH_JD

20120801

Pass

John Doe

Test Results

X-Ray

JD|PHYS_JD

20120513

1010110110100…

Accumulo  Key/Value  Example   Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

5  

THE ACCUMULO CLIENT API new  ZooKeeperInstance(...)  

Instance

new  MockInstance()  

getConnector(...)  

Range

Connector

IteratorOption

TableOperations InstanceOperations

createScanner(...)   createBatchScanner(...)  

createBatchWriter(...)  

SecurityOperations Scanner

BatchScanner

BatchWriter

iterator()   addMuta3on(...)  

Map.Entry Key Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

Mutation

Value 6  

ACCUMULO TECHNOLOGY Strengths •  Shared-Nothing => Scalability •  Micro-Batching for Efficient Random I/O •  High Concurrency, Low Latency for Denormalized Data •  Sparse, Flexible Schema supports dynamic and diverse data models •  Cell-level Security promotes sharing Weaknesses •  Sorting induces write multiplication factor •  Sparse schema support induces additional storage overhead Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

Tablet  Data  Flow   Scan   Writes  

In-­‐Memory   Map  

Minor   Compac0 on  

Sorted,   Indexed   File   Write  Ahead   Log   (For  Recovery)  

Iterator   Tree  

Iterator   Tree  

Reads  

Sorted,   Indexed   File  

Merging  /  Major   Compac0on  

Iterator   Tree  

Sorted,   Indexed   File  

Zookeeper   Delegate  Authority   Tablet  Server   Zookeeper   Zookeeper  

Tablet  

Delegate  Authority   Assign/Balance  

Tablet  Server  

Master   Tablet   Store/Replicate  

HDFS  

Tablet  Server  

Read/Write  

Applica3on  

Applica3on  

Applica3on  

Tablet  

7  

TODAY’S DISCUSSION Data Modeling and Indexing for Apache Accumulo

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs 1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

8  

PROXY/NETFLOW EXAMPLE Source

Destination

Port

Bytes In

Bytes Out

Protocol

10.1.2.3 google.com

80

73,824

15,632

http

10.1.2.4 facebook.com

443

10,328

13,284,129

https

10.1.2.4 google.com

80

623,249

93,125

http

10.1.2.3 abcd1234.ru

3133 7

158

523,698,104

unknown

10.1.2.3 netflix.com

443

434,855,357 1,392,994

https

10.1.2.4 google.com

443

23,084

583,331

https

10.1.2.3 10.1.2.5

22

204

158

ssh

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

9  

INDEXES AND QFDS

Input

Indexes QuestionFocused Datasets

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

Transformation

Logs/ Observations

•  Immutable   •  Append-­‐Only   •  •  •  •  • 

Real-­‐Time   Online   Sorted   Grouped   Aggregated  

10  

QFD KEY GENERATION Source

Destination

10.1.2.3 google.com

Port

Bytes In

Bytes Out

Protocol

80

73,824

15,632

http

Key              -­‐>    Value   10.1.2.3,  Bytes  In        -­‐>  +73,824   10.1.2.3,  Bytes  Out      -­‐>  +15,632   10.1.2.3,  Ports  Used      -­‐>  +{80}   10.1.2.3,  Protocols  Used    -­‐>  +{hap}  

Hosts QFD 0x00 . . . 0xFF

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

11  

HOSTS QFD WITH AGGREGATION IP

Ports Used

10.1.2.3

Protos Used

Total Bytes In

Total Bytes Out

Ports Protos Hosted Hosted

{22, 80, {http, 443, https, ssh, 31337} unknown}

434,931,543

525,106,888

-

-

10.1.2.4

{80, 443}

{http, https}

656,661

13,960,585

-

-

10.1.2.5

-

-

158 158  +3,215   3,373  

204

{22}

{ssh}

New  Contribu3on:  (10.1.2.5,  Total  Bytes  In  -­‐>  +3,215)  

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

12  

CONNECTIVITY GRAPH 10.1.2.4

google.com facebook.co m

10.1.2.3

abcd1234.ru 10.1.2.5

Row

Col. Fam.

Col. Qual.

netflix.com

Val.

Row

Col. Fam.

Col. Qual.

Val

10.1.2.3

Contacts

10.1.2.5

-

10.1.2.5

Serves

10.1.2.3

-

10.1.2.3

Contacts

abcd1234.ru

-

abcd1234.ru

Serves

10.1.2.3

-

10.1.2.3

Contacts

google.com

-

facebook.com

Serves

10.1.2.4

-

10.1.2.3

Contacts

netflix.com

-

google.com

Serves

10.1.2.3

-

10.1.2.4

Contacts

facebook.com

-

google.com

Serves

10.1.2.4

-

10.1.2.4

Contacts

google.com

-

netflix.com

Serves

10.1.2.3

-

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

13  

INVERTED INDEXING Table:  

Forward  Index  

Inverted  Index  

Row:  

 

 

Column  Family:  

 

 

Column  Qualifier:  

 

 

Value:  

 

 

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

14  

INVERTED INDEXING

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

15  

ADVANCED INDEXING Table:  

Shard  Table  

Row:  

 

Column  Family:  

Column  Qualifier   (Tuples):  

Value:  

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

“Docs”  

“Inv.  Index”   “Field  Index”  

“Geo”  

 

 

 

 

 

 

 

 

 

16  

TODAY’S DISCUSSION Data Modeling and Indexing for Apache Accumulo

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs 1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

17  

SQRRL ENTERPRISE

Simple API for Advanced Accumulo Usage • 

Dynamic Documents •  • 

• 

Dynamic Graphs • 

• 

Co-partitioned with Documents for Integrated Search and Discovery

Search •  • 

• 

JSON I/O support Cell-level Security and Efficient Aggregation Extensions

Lucene Query Syntax Accumulo Indexes Preserve Security Model

Processing •  • 

SQL-Like Language for Transforming and Aggregating Results Parallel Slicing and Extraction

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

18  

REAL-TIME OPERATIONAL APPS Contact us for a demo

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

19  

HOW TO LEARN MORE Download our White Paper "  www.sqrrl.com/whitepaper

Watch a video "  www.sqrrl.com/downloads#videos

Request a demo or one-on-one workshop "  www.sqrrl.com/contact

Come meet us "  "  "  " 

Accumulo Meetup (October 28, New York) Strata + Hadoop World (October 28-30, New York) IBM IOD (November 4-7, Las Vegas) SC13 (November 18-21, Denver)

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

20  

THANK YOU Thanks for attending! To keep up to date with Sqrrl, check out or social media sites: www.twitter.com/sqrrl_inc www.linkedin.com/company/sqrrl

Sqrrl  Data,  Inc.  Confiden3al  and  Proprietary  

21