SynopSys: Foundations for Multidimensional ... - Semantic Scholar

5 downloads 0 Views 964KB Size Report
Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph Cube: On Warehousing and OLAP Multidimensional Networks. In Proceedings of the ACM SIGMOD ...
Michael Rudolf1 , Hannes Voigt1 , Christof Bornhoevd2 , and Wolfgang Lehner1

SynopSys: Foundations for Multidimensional Graph Analytics Business Intelligence for the Real-Time Enterprise (BIRTE 2014) 1 Database 2 SAP

Technology Group, Technische Universität Dresden

Labs, LLC, Palo Alto

September 1, 2014

Motivation: Big (Graph) Data Peak Performance

645 M users 135 K new every day

26.5 M items (306/sec) Nov. 23, 2013: 36.8 M items (426/sec) 58 M tweets & 2.1 G searches / day Nov. 26, 2012:

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

2

Motivation: Big (Graph) Data 645 M users 135 K new every day

Peak Performance

26.5 M items (306/sec) Nov. 23, 2013: 36.8 M items (426/sec) 58 M tweets & 2.1 G searches / day Nov. 26, 2012:

Intensional vs. Extensional ˆ Schema & integrity constraints ˆ Collect lots of data rst ˆ Created at design time by

domain experts

ˆ Try to deduce the intension ˆ

The Fourth Paradigm [Mic09]

ETL

... Once & forever © Michael Rudolf |

Time SynopSys: Foundations for Multidimensional Graph Analytics

|

2

d

b

The Property Graph Model

8

Consumer Electronics

4

Freddy FR

black

part of

part of

2

authors 5/5 stars

Tablets

Phones

7

12

white 1

3

64 GB

Apple iPhone 4

4/5 stars

15

records

Steve US f

10

Karl DE

delivered 24/02/14

11

likes

9 contains 1

e

© Michael Rudolf |

records

13

likes

contains 2

black rates

14

5/5 stars

ordered 24/02/14

32 GB

authors

16 GB

rates Apple iPad MC707LL/A

rates

in

in

16

Apple iPhone 5

in

5

contains 1

g

SynopSys: Foundations for Multidimensional Graph Analytics

Mike US

h

|

3

d

b

The Property Graph Model

8

Consumer Electronics

4

Freddy FR

black

part of

part of

2

authors 5/5 stars

Tablets

Phones

7

12

white 1

3

64 GB

Apple iPhone 4

4/5 stars

likes

contains 2

black rates

14

5/5 stars

records

15

records

delivered 24/02/14

11

13

Steve US f

likes

9 contains 1

ordered 24/02/14

32 GB

authors

16 GB

rates Apple iPad MC707LL/A

rates

in

in

16

Apple iPhone 5

in

5

contains 1

Karl DE

g e ˆ Provides directed, attributed multi-relational graphs

10

Mike US

h

ˆ Attributes on vertices and edges as key-value pairs

(instance-level instead of class-level) © Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

3

Agenda

Analytical Scenario: From Graphs to Cubes

Operations: Roll-up, Drill-down, Slice & Dice

Challenges: Unbalanced Hierarchies & OLAP Anomalies

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

4

d

b

Graph Cube

8

Consumer Electronics

4

Freddy FR

black

part of

part of

2

authors 5/5 stars

Tablets

Phones

7

12

white 1

3

64 GB

Apple iPhone 4

4/5 stars

15

records

Steve US f

10

Karl DE

delivered 24/02/14

11

likes

9 contains 1

e

© Michael Rudolf |

records

13

likes

contains 2

black rates

14

5/5 stars

ordered 24/02/14

32 GB

authors

16 GB

rates Apple iPad MC707LL/A

rates

in

in

16

Apple iPhone 5

in

5

contains 1

g

SynopSys: Foundations for Multidimensional Graph Analytics

Mike US

h

|

5

d

b

Graph Cube

8

Consumer Electronics

4

Freddy FR

black

part of

part of

2

authors 5/5 stars

Tablets

Phones

7

12

white 1

3

64 GB

Apple iPhone 4

1. Identify facts

15

records

e

Steve US f

10

Karl DE

delivered 24/02/14

11

likes

9 contains 1

4/5 stars

© Michael Rudolf |

records

13

likes

contains 2

black rates

14

5/5 stars

ordered 24/02/14

32 GB

authors

16 GB

rates Apple iPad MC707LL/A

rates

in

in

16

Apple iPhone 5

in

5

contains 1

g

SynopSys: Foundations for Multidimensional Graph Analytics

Mike US

h

|

5

d

b

Graph Cube

8

Consumer Electronics

4

Freddy FR

black

part of

part of

2

authors 5/5 stars

Tablets

Phones

7

12

white 1

3

64 GB

Apple iPhone 4

records

4/5 stars

15

records

1. Identify facts

e

f

10

Karl DE

delivered 24/02/14

Steve US

likes

9 contains 1

11

13

likes

contains 2

black rates

14

5/5 stars

ordered 24/02/14

32 GB

authors

16 GB

rates Apple iPad MC707LL/A

rates

in

in

16

Apple iPhone 5

in

5

contains 1

g

Mike US

h

2. Specify dimensions

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

5

d

b

Graph Cube

8

Consumer Electronics

4

Freddy FR

black

part of

part of

2

authors 5/5 stars

Tablets

Phones

7

12

white 1

3

64 GB

Apple iPhone 4

records

4/5 stars

15

records

1. Identify facts

e

f

10

Karl DE

delivered 24/02/14

Steve US

likes

9 contains 1

11

13

likes

contains 2

black rates

14

5/5 stars

ordered 24/02/14

32 GB

authors

16 GB

rates Apple iPad MC707LL/A

rates

in

in

16

Apple iPhone 5

in

5

contains 1

g

Mike US

h

2. Specify dimensions 3. De ne measures © Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

5

Facts Depending on the use case, a (base) fact can be ˆ a vertex attribute, an edge attribute, or ˆ the presence of an edge.

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

6

Facts Depending on the use case, a (base) fact can be ˆ a vertex attribute, an edge attribute, or ˆ the presence of an edge.

© Michael Rudolf |

in general: a subgraph

SynopSys: Foundations for Multidimensional Graph Analytics

|

6

Facts Depending on the use case, a (base) fact can be ˆ a vertex attribute, an edge attribute, or ˆ the presence of an edge.

Ô Use pattern matching

© Michael Rudolf |

in general: a subgraph

Ô graphical speci cation instead of DSL

SynopSys: Foundations for Multidimensional Graph Analytics

|

6

Facts Depending on the use case, a (base) fact can be ˆ a vertex attribute, an edge attribute, or ˆ the presence of an edge.

Ô Use pattern matching

in general: a subgraph

Ô graphical speci cation instead of DSL

Example authors

© Michael Rudolf |

rates

Match reviews of products and their authors (vertex types indicated via color)

SynopSys: Foundations for Multidimensional Graph Analytics

|

6

d

Dimensions

b

Dimensions can be 1. vertex or edge attributes 2. connectivity

part of

Tablets 12

in

1

Phones

7 16 GB 3

64 GB

2

4/5 stars

in

Apple iPhone 4

contains 1

15

contains 1

16

ordered 24/02/14

32 GB records

Apple iPhone 5 authors

rates

5/5 stars

11

13

Steve US

f

likes likes

contains 2

black rates

5

in white

Apple iPad MC707LL/A 14

9

10

records delivered 24/02/14

Karl DE

e

© Michael Rudolf |

black

part of

authors 5/5 stars

rates c

Consumer Electronics

4

Freddy FR

8

g

SynopSys: Foundations for Multidimensional Graph Analytics

Mike US

h

|

7

d

Dimensions

b

Dimensions can be 8

1. vertex or edge attributes 2. connectivity

part of

black

part of

2

authors 5/5 stars

Tablets 12

in

rates Structure in Dimensions c

Consumer Electronics

4

Freddy FR

Phones

7

5

in

in 16 GB white

3

Apple iPad

Apple iPhone 4

1 64 GB MC707LL/A contains 2 ˆ extrinsic: not contained in graph data, black

11

Steve US

f

likes likes 9

delivered 24/02/14

10

Mike US

DE

ˆ intrinsic: embodied in graph data

e

explicit: captured as topological information implicit: has to be derived from attribute values

© Michael Rudolf |

ordered 24/02/14 records

13

15 rates externally 14 contains 1 (e.g., needs to be provided GeoNames) records Karl 4/5 stars

16

Apple iPhone 5 authors

rates

5/5 stars

contains 1 32 GB

g

SynopSys: Foundations for Multidimensional Graph Analytics

h

|

7

Intrinsic Dimensions

@

alias

-

[

attribute

edge predicate

access of vertex or edge attribute ]

->

[

vertex predicate

]

(

length

)

paths (with optional recursion depth), optionally satisfying the predicates © Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

8

Intrinsic Dimensions Explicit Dimensions ˆ Can be speci ed using path expressions ˆ In general requires one path expression per level, e.g. -[ @type = ' belongsTo '] - >[ @type = ' state '] -[ @type = ' partOf '] - >[ @type = ' country ']

@

alias

-

[

attribute

edge predicate

access of vertex or edge attribute ]

->

[

vertex predicate

]

(

length

)

paths (with optional recursion depth), optionally satisfying the predicates © Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

8

Intrinsic Dimensions Explicit Dimensions ˆ Can be speci ed using path expressions ˆ In general requires one path expression per level, e.g. -[ @type = ' belongsTo '] - >[ @type = ' state '] -[ @type = ' partOf '] - >[ @type = ' country ']

Implicit Dimensions ˆ Might require bucketization ˆ In general requires one expression per level, e.g. GetWeekOfYear(@ordered) @

alias

-

[

attribute

edge predicate

and GetYear(@ordered)

access of vertex or edge attribute ]

->

[

vertex predicate

]

(

length

)

paths (with optional recursion depth), optionally satisfying the predicates © Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

8

Dimension Speci cation Example Name Nationality

Seed Pattern $c

Levels $c@nationality

Product category: $p-[@type='in']->

Category

$p

Product group: $p-[@type='in']->-[@type='part-of']->

Product area: $p-[@type='in']->-[@type='part-of']->(2)

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

9

Dimension Speci cation Example Name Nationality

Seed Pattern $c

Levels $c@nationality

Product category: $p-[@type='in']->

Category

$p

Product group: $p-[@type='in']->-[@type='part-of']->

Product area: $p-[@type='in']->-[@type='part-of']->(2)

Seed Pattern ˆ Connects facts to dimensions ˆ Is matched against facts Ô Has to be a super pattern of the fact pattern (i.e., more general) © Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

9

Properties of Dimensions Monotony Levels should be ordered such that the number of items decreases.

© Michael Rudolf |

Level

Name

# Elements

1

Region

125

2

Country

30

3

Continent

3

SynopSys: Foundations for Multidimensional Graph Analytics

|

10

Properties of Dimensions Monotony Levels should be ordered such that the number of items decreases.

Hierarchy Levels should form hierarchies. If two facts map to the same element in li , they should map to the same element in li+1 as well. Ô Functional dependency

© Michael Rudolf |

Level

Name

# Elements

1

Region

125

2

Country

30

3

Continent

3

Fact

Level 1

Level 2

Level 3

A

Saxony

Germany

Europe

B

Saxony

Germany

Europe

C

Bavaria

Germany

Europe

SynopSys: Foundations for Multidimensional Graph Analytics

|

10

Measures A measure is a derived fact ˆ combining several facts ˆ computed by a speci ed function

(e.g., scalar, aggregation).

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

11

Measures A measure is a derived fact ˆ combining several facts ˆ computed by a speci ed function

(e.g., scalar, aggregation). Ô Annotate the fact pattern Ô Introduce representative vertex

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

11

Measures Example

A measure is a derived fact

ˆ Average product rating by

ˆ combining several facts

product category

ˆ computed by a speci ed function

(e.g., scalar, aggregation).

ˆ Minimum age of customers

by nationality

Ô Annotate the fact pattern Ô Introduce representative vertex

authors $a $c

$r

rates $e ++

$p (Avg. Rtg., $r@stars, AVG)

(Min. Age, $c@age, MIN)

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

11

Operations: Roll-up, Drill-down, Slice & Dice

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

12

Roll-up/Drill-down Granularity of the Cube ˆ Represents the grouping : the current levels of interest ˆ Initially: the lowest level of each dimension

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

13

Roll-up/Drill-down Granularity of the Cube ˆ Represents the grouping : the current levels of interest ˆ Initially: the lowest level of each dimension

Roll-up ˆ Reduces the granularity ˆ For dimension d, move up one level from li to li+1

Drill-down ˆ Increases the granularity ˆ For dimension d, move down one level from li to li−1

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

13

Roll-up/Drill-down Granularity of the Cube ˆ Represents the grouping : the current levels of interest ˆ Initially: the lowest level of each dimension

Roll-up ˆ Reduces the granularity ˆ For dimension d, move up one level from li to li+1

Drill-down ˆ Increases the granularity ˆ For dimension d, move down one level from li to li−1

Ô Introduce representative vertex for each group Ô Expose computed values for measures as attributes © Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

13

Slice & Dice Function lter transforms fact base of cube ˆ evaluates level-predicate pairs ˆ removes facts not matching the predicates

For a single predicate applied to one dimension Ô slice

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

14

Slice & Dice Function lter transforms fact base of cube ˆ evaluates level-predicate pairs ˆ removes facts not matching the predicates

For a single predicate applied to one dimension Ô slice

Example Slice product reviews by German customers from the cube c: filter(c, {(Nationality, λ = DE )}).

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

14

Challenges: Unbalanced Hierarchies & OLAP Anomalies

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

15

Unbalanced Hierarchies Facts with di erent granularities

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

16

Unbalanced Hierarchies Facts with di erent granularities

Example Products in categories and groups

Computers & Accessories

4 6

part of

part of

7

part of

Smartphones in

12

Tablets

red

Cell Phones & Accessories

15

Phones

5

in 16 GB

Google Nexus 5

black

16

Samsung E1200

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

16

Unbalanced Hierarchies Facts with di erent granularities

Example

Relative dimension speci cation:

Products in categories and groups

Product category: $p-[@type='in']->

Product group: $p-[@type='in']->-[@type='part-of']->

Computers & Accessories

4 6

part of

Product area: $p-[@type='in']->-[@type='part-of']->(2)

part of

7

part of

Smartphones in

12

Tablets

red

Cell Phones & Accessories

15

Phones

5

in 16 GB

Google Nexus 5

black

16

Samsung E1200

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

16

Unbalanced Hierarchies Facts with di erent granularities

Example

Relative dimension speci cation:

Products in categories and groups

Product category: $p-[@type='in']->

Product group: $p-[@type='in']->-[@type='part-of']->

Computers & Accessories

4 6

part of

Product area: $p-[@type='in']->-[@type='part-of']->(2)

part of

7

part of

Smartphones in

12

Tablets

red

Cell Phones & Accessories

15

Phones

5

in 16 GB

Google Nexus 5

black

16

Samsung E1200

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

16

Unbalanced Hierarchies Facts with di erent granularities

Example

Relative dimension speci cation:

Products in categories and groups

Product category: $p-[@type='in']->

Product group: $p-[@type='in']->-[@type='part-of']->

Computers & Accessories

4 6

part of

Product area: $p-[@type='in']->-[@type='part-of']->(2)

part of

7

part of

Smartphones in

12

Tablets

red

Cell Phones & Accessories

15

Phones

5

in 16 GB

Google Nexus 5

black

16

Samsung E1200

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

16

Unbalanced Hierarchies Facts with di erent granularities

Example

Relative dimension speci cation:

Products in categories and groups

Product category: $p-[@type='in']->

Product group: $p-[@type='in']->-[@type='part-of']->

Computers & Accessories

4 6

part of

Product area: $p-[@type='in']->-[@type='part-of']->(2)

Ô Absolute instead of relative dimension speci cation required

part of

7

part of

Smartphones in

12

Tablets

red

Cell Phones & Accessories

15

Phones

5

in 16 GB

Google Nexus 5

black

16

Samsung E1200

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

16

Unbalanced Hierarchies Solution: Pre-process the graph

Computers & Accessories

4 6

part of

part of

7

part of

Smartphones in

12

Tablets

red

Cell Phones & Accessories

15

Phones

5

in 16 GB

Google Nexus 5

black

16

Samsung E1200

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

17

Unbalanced Hierarchies Solution: Pre-process the graph Data Cleansing ˆ Balance hierarchies ˆ Add missing root nodes

14 Computers & Accessories

Consumer Electronics part of

part of

4

6 part of

part of

7

part of

Smartphones in

12

Tablets

Phones

5 part of

Dumbphones

13 red

Cell Phones & Accessories

15

in

16 GB

Google Nexus 5

black

16

Samsung E1200

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

17

Unbalanced Hierarchies Solution: Pre-process the graph Data Cleansing ˆ Balance hierarchies ˆ Add missing root nodes Computers & Accessories

Tagging

2

4

6

part of

Cell Phones & Accessories

part of

ˆ Add attributes for absolute

referencing

3

2 Tablets

7

part of

1

Smartphones in

red

15

12

Phones

5 1

in 16 GB

Google Nexus 5

black

16

Samsung E1200

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

17

OLAP Anomalies It depends on the model: ˆ double counting can occur, if a

cardinality assumption is violated (1:1 vs. 1:n relationship)

© Michael Rudolf |

5 Tablets in

in 128 GB

Phones

7

15

Apple iPad Air

SynopSys: Foundations for Multidimensional Graph Analytics

in 16

black

Samsung E1200

|

18

OLAP Anomalies It depends on the model: ˆ double counting can occur, if a

cardinality assumption is violated (1:1 vs. 1:n relationship) ˆ incompleteness can occur, if a

connectivity assumption is violated

5 Tablets in

in 128 GB

16

5

Phones

7 in 16

128 GB

black

Samsung E1200

Apple iPad Air

15

Apple iPad Air

© Michael Rudolf |

in

15

Tablets in

Phones

7

SynopSys: Foundations for Multidimensional Graph Analytics

black

Samsung E1200

|

18

Conclusion Powerful Mapping of Multidimensional Analytics ˆ Expose well-known concepts and operations ˆ Emphasize challenges posed by graph data

Ô Open up the graph world to Business Intelligence Flexible Work ow for the Big Graph Data Era ˆ No up-front schema design ˆ Adapt to changing data and requirements

Ô What is a fact today can be a dimension tomorrow

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

19

1

Additional Material & References

References I Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and Philip S. Yu. Graph OLAP: Towards Online Analytical Processing on Graphs. In Proceedings of the Eighth International Conference on Data Mining, pages 103 112, Pisa, Italy, December 2008. IEEE. Microsoft Research. The Fourth Paradigm: Data-Intensive Scienti c Discovery. Microsoft Press, 2009. Marko A. Rodriguez and Peter Neubauer. Constructions from Dots and Lines. Bulletin of the American Society for Information Science and Technology, 36(6):35 41, 2010. Yuanyuan Tian and Jignesh M. Patel. TALE: A Tool for Approximate Large Graph Matching. In 2008 IEEE 24th International Conference on Data Engineering, pages 963 972. IEEE, April 2008. Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph Cube: On Warehousing and OLAP Multidimensional Networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 853 864, Athens, Greece, 2011. ACM.

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

2

References II Ning Zhang, Yuanyuan Tian, and Jignesh M. Patel. Discovery-Driven Graph Summarization. In Proceedings of the 26th International Conference on Data Engineering, pages 880 891, Long Beach, CA, USA, 2010. IEEE.

© Michael Rudolf |

SynopSys: Foundations for Multidimensional Graph Analytics

|

3