Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph Cube: On Warehousing and OLAP Multidimensional Networks. In Proceedings of the ACM SIGMOD ...
Michael Rudolf1 , Hannes Voigt1 , Christof Bornhoevd2 , and Wolfgang Lehner1
SynopSys: Foundations for Multidimensional Graph Analytics Business Intelligence for the Real-Time Enterprise (BIRTE 2014) 1 Database 2 SAP
Technology Group, Technische Universität Dresden
Labs, LLC, Palo Alto
September 1, 2014
Motivation: Big (Graph) Data Peak Performance
645 M users 135 K new every day
26.5 M items (306/sec) Nov. 23, 2013: 36.8 M items (426/sec) 58 M tweets & 2.1 G searches / day Nov. 26, 2012:
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
2
Motivation: Big (Graph) Data 645 M users 135 K new every day
Peak Performance
26.5 M items (306/sec) Nov. 23, 2013: 36.8 M items (426/sec) 58 M tweets & 2.1 G searches / day Nov. 26, 2012:
Intensional vs. Extensional ˆ Schema & integrity constraints ˆ Collect lots of data rst ˆ Created at design time by
domain experts
ˆ Try to deduce the intension ˆ
The Fourth Paradigm [Mic09]
ETL
... Once & forever © Michael Rudolf |
Time SynopSys: Foundations for Multidimensional Graph Analytics
|
2
d
b
The Property Graph Model
8
Consumer Electronics
4
Freddy FR
black
part of
part of
2
authors 5/5 stars
Tablets
Phones
7
12
white 1
3
64 GB
Apple iPhone 4
4/5 stars
15
records
Steve US f
10
Karl DE
delivered 24/02/14
11
likes
9 contains 1
e
© Michael Rudolf |
records
13
likes
contains 2
black rates
14
5/5 stars
ordered 24/02/14
32 GB
authors
16 GB
rates Apple iPad MC707LL/A
rates
in
in
16
Apple iPhone 5
in
5
contains 1
g
SynopSys: Foundations for Multidimensional Graph Analytics
Mike US
h
|
3
d
b
The Property Graph Model
8
Consumer Electronics
4
Freddy FR
black
part of
part of
2
authors 5/5 stars
Tablets
Phones
7
12
white 1
3
64 GB
Apple iPhone 4
4/5 stars
likes
contains 2
black rates
14
5/5 stars
records
15
records
delivered 24/02/14
11
13
Steve US f
likes
9 contains 1
ordered 24/02/14
32 GB
authors
16 GB
rates Apple iPad MC707LL/A
rates
in
in
16
Apple iPhone 5
in
5
contains 1
Karl DE
g e ˆ Provides directed, attributed multi-relational graphs
10
Mike US
h
ˆ Attributes on vertices and edges as key-value pairs
(instance-level instead of class-level) © Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
3
Agenda
Analytical Scenario: From Graphs to Cubes
Operations: Roll-up, Drill-down, Slice & Dice
Challenges: Unbalanced Hierarchies & OLAP Anomalies
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
4
d
b
Graph Cube
8
Consumer Electronics
4
Freddy FR
black
part of
part of
2
authors 5/5 stars
Tablets
Phones
7
12
white 1
3
64 GB
Apple iPhone 4
4/5 stars
15
records
Steve US f
10
Karl DE
delivered 24/02/14
11
likes
9 contains 1
e
© Michael Rudolf |
records
13
likes
contains 2
black rates
14
5/5 stars
ordered 24/02/14
32 GB
authors
16 GB
rates Apple iPad MC707LL/A
rates
in
in
16
Apple iPhone 5
in
5
contains 1
g
SynopSys: Foundations for Multidimensional Graph Analytics
Mike US
h
|
5
d
b
Graph Cube
8
Consumer Electronics
4
Freddy FR
black
part of
part of
2
authors 5/5 stars
Tablets
Phones
7
12
white 1
3
64 GB
Apple iPhone 4
1. Identify facts
15
records
e
Steve US f
10
Karl DE
delivered 24/02/14
11
likes
9 contains 1
4/5 stars
© Michael Rudolf |
records
13
likes
contains 2
black rates
14
5/5 stars
ordered 24/02/14
32 GB
authors
16 GB
rates Apple iPad MC707LL/A
rates
in
in
16
Apple iPhone 5
in
5
contains 1
g
SynopSys: Foundations for Multidimensional Graph Analytics
Mike US
h
|
5
d
b
Graph Cube
8
Consumer Electronics
4
Freddy FR
black
part of
part of
2
authors 5/5 stars
Tablets
Phones
7
12
white 1
3
64 GB
Apple iPhone 4
records
4/5 stars
15
records
1. Identify facts
e
f
10
Karl DE
delivered 24/02/14
Steve US
likes
9 contains 1
11
13
likes
contains 2
black rates
14
5/5 stars
ordered 24/02/14
32 GB
authors
16 GB
rates Apple iPad MC707LL/A
rates
in
in
16
Apple iPhone 5
in
5
contains 1
g
Mike US
h
2. Specify dimensions
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
5
d
b
Graph Cube
8
Consumer Electronics
4
Freddy FR
black
part of
part of
2
authors 5/5 stars
Tablets
Phones
7
12
white 1
3
64 GB
Apple iPhone 4
records
4/5 stars
15
records
1. Identify facts
e
f
10
Karl DE
delivered 24/02/14
Steve US
likes
9 contains 1
11
13
likes
contains 2
black rates
14
5/5 stars
ordered 24/02/14
32 GB
authors
16 GB
rates Apple iPad MC707LL/A
rates
in
in
16
Apple iPhone 5
in
5
contains 1
g
Mike US
h
2. Specify dimensions 3. De ne measures © Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
5
Facts Depending on the use case, a (base) fact can be ˆ a vertex attribute, an edge attribute, or ˆ the presence of an edge.
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
6
Facts Depending on the use case, a (base) fact can be ˆ a vertex attribute, an edge attribute, or ˆ the presence of an edge.
© Michael Rudolf |
in general: a subgraph
SynopSys: Foundations for Multidimensional Graph Analytics
|
6
Facts Depending on the use case, a (base) fact can be ˆ a vertex attribute, an edge attribute, or ˆ the presence of an edge.
Ô Use pattern matching
© Michael Rudolf |
in general: a subgraph
Ô graphical speci cation instead of DSL
SynopSys: Foundations for Multidimensional Graph Analytics
|
6
Facts Depending on the use case, a (base) fact can be ˆ a vertex attribute, an edge attribute, or ˆ the presence of an edge.
Ô Use pattern matching
in general: a subgraph
Ô graphical speci cation instead of DSL
Example authors
© Michael Rudolf |
rates
Match reviews of products and their authors (vertex types indicated via color)
SynopSys: Foundations for Multidimensional Graph Analytics
|
6
d
Dimensions
b
Dimensions can be 1. vertex or edge attributes 2. connectivity
part of
Tablets 12
in
1
Phones
7 16 GB 3
64 GB
2
4/5 stars
in
Apple iPhone 4
contains 1
15
contains 1
16
ordered 24/02/14
32 GB records
Apple iPhone 5 authors
rates
5/5 stars
11
13
Steve US
f
likes likes
contains 2
black rates
5
in white
Apple iPad MC707LL/A 14
9
10
records delivered 24/02/14
Karl DE
e
© Michael Rudolf |
black
part of
authors 5/5 stars
rates c
Consumer Electronics
4
Freddy FR
8
g
SynopSys: Foundations for Multidimensional Graph Analytics
Mike US
h
|
7
d
Dimensions
b
Dimensions can be 8
1. vertex or edge attributes 2. connectivity
part of
black
part of
2
authors 5/5 stars
Tablets 12
in
rates Structure in Dimensions c
Consumer Electronics
4
Freddy FR
Phones
7
5
in
in 16 GB white
3
Apple iPad
Apple iPhone 4
1 64 GB MC707LL/A contains 2 ˆ extrinsic: not contained in graph data, black
11
Steve US
f
likes likes 9
delivered 24/02/14
10
Mike US
DE
ˆ intrinsic: embodied in graph data
e
explicit: captured as topological information implicit: has to be derived from attribute values
© Michael Rudolf |
ordered 24/02/14 records
13
15 rates externally 14 contains 1 (e.g., needs to be provided GeoNames) records Karl 4/5 stars
16
Apple iPhone 5 authors
rates
5/5 stars
contains 1 32 GB
g
SynopSys: Foundations for Multidimensional Graph Analytics
h
|
7
Intrinsic Dimensions
@
alias
-
[
attribute
edge predicate
access of vertex or edge attribute ]
->
[
vertex predicate
]
(
length
)
paths (with optional recursion depth), optionally satisfying the predicates © Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
8
Intrinsic Dimensions Explicit Dimensions ˆ Can be speci ed using path expressions ˆ In general requires one path expression per level, e.g. -[ @type = ' belongsTo '] - >[ @type = ' state '] -[ @type = ' partOf '] - >[ @type = ' country ']
@
alias
-
[
attribute
edge predicate
access of vertex or edge attribute ]
->
[
vertex predicate
]
(
length
)
paths (with optional recursion depth), optionally satisfying the predicates © Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
8
Intrinsic Dimensions Explicit Dimensions ˆ Can be speci ed using path expressions ˆ In general requires one path expression per level, e.g. -[ @type = ' belongsTo '] - >[ @type = ' state '] -[ @type = ' partOf '] - >[ @type = ' country ']
Implicit Dimensions ˆ Might require bucketization ˆ In general requires one expression per level, e.g. GetWeekOfYear(@ordered) @
alias
-
[
attribute
edge predicate
and GetYear(@ordered)
access of vertex or edge attribute ]
->
[
vertex predicate
]
(
length
)
paths (with optional recursion depth), optionally satisfying the predicates © Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
8
Dimension Speci cation Example Name Nationality
Seed Pattern $c
Levels $c@nationality
Product category: $p-[@type='in']->
Category
$p
Product group: $p-[@type='in']->-[@type='part-of']->
Product area: $p-[@type='in']->-[@type='part-of']->(2)
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
9
Dimension Speci cation Example Name Nationality
Seed Pattern $c
Levels $c@nationality
Product category: $p-[@type='in']->
Category
$p
Product group: $p-[@type='in']->-[@type='part-of']->
Product area: $p-[@type='in']->-[@type='part-of']->(2)
Seed Pattern ˆ Connects facts to dimensions ˆ Is matched against facts Ô Has to be a super pattern of the fact pattern (i.e., more general) © Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
9
Properties of Dimensions Monotony Levels should be ordered such that the number of items decreases.
© Michael Rudolf |
Level
Name
# Elements
1
Region
125
2
Country
30
3
Continent
3
SynopSys: Foundations for Multidimensional Graph Analytics
|
10
Properties of Dimensions Monotony Levels should be ordered such that the number of items decreases.
Hierarchy Levels should form hierarchies. If two facts map to the same element in li , they should map to the same element in li+1 as well. Ô Functional dependency
© Michael Rudolf |
Level
Name
# Elements
1
Region
125
2
Country
30
3
Continent
3
Fact
Level 1
Level 2
Level 3
A
Saxony
Germany
Europe
B
Saxony
Germany
Europe
C
Bavaria
Germany
Europe
SynopSys: Foundations for Multidimensional Graph Analytics
|
10
Measures A measure is a derived fact ˆ combining several facts ˆ computed by a speci ed function
(e.g., scalar, aggregation).
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
11
Measures A measure is a derived fact ˆ combining several facts ˆ computed by a speci ed function
(e.g., scalar, aggregation). Ô Annotate the fact pattern Ô Introduce representative vertex
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
11
Measures Example
A measure is a derived fact
ˆ Average product rating by
ˆ combining several facts
product category
ˆ computed by a speci ed function
(e.g., scalar, aggregation).
ˆ Minimum age of customers
by nationality
Ô Annotate the fact pattern Ô Introduce representative vertex
authors $a $c
$r
rates $e ++
$p (Avg. Rtg., $r@stars, AVG)
(Min. Age, $c@age, MIN)
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
11
Operations: Roll-up, Drill-down, Slice & Dice
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
12
Roll-up/Drill-down Granularity of the Cube ˆ Represents the grouping : the current levels of interest ˆ Initially: the lowest level of each dimension
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
13
Roll-up/Drill-down Granularity of the Cube ˆ Represents the grouping : the current levels of interest ˆ Initially: the lowest level of each dimension
Roll-up ˆ Reduces the granularity ˆ For dimension d, move up one level from li to li+1
Drill-down ˆ Increases the granularity ˆ For dimension d, move down one level from li to li−1
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
13
Roll-up/Drill-down Granularity of the Cube ˆ Represents the grouping : the current levels of interest ˆ Initially: the lowest level of each dimension
Roll-up ˆ Reduces the granularity ˆ For dimension d, move up one level from li to li+1
Drill-down ˆ Increases the granularity ˆ For dimension d, move down one level from li to li−1
Ô Introduce representative vertex for each group Ô Expose computed values for measures as attributes © Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
13
Slice & Dice Function lter transforms fact base of cube ˆ evaluates level-predicate pairs ˆ removes facts not matching the predicates
For a single predicate applied to one dimension Ô slice
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
14
Slice & Dice Function lter transforms fact base of cube ˆ evaluates level-predicate pairs ˆ removes facts not matching the predicates
For a single predicate applied to one dimension Ô slice
Example Slice product reviews by German customers from the cube c: filter(c, {(Nationality, λ = DE )}).
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
14
Challenges: Unbalanced Hierarchies & OLAP Anomalies
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
15
Unbalanced Hierarchies Facts with di erent granularities
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
16
Unbalanced Hierarchies Facts with di erent granularities
Example Products in categories and groups
Computers & Accessories
4 6
part of
part of
7
part of
Smartphones in
12
Tablets
red
Cell Phones & Accessories
15
Phones
5
in 16 GB
Google Nexus 5
black
16
Samsung E1200
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
16
Unbalanced Hierarchies Facts with di erent granularities
Example
Relative dimension speci cation:
Products in categories and groups
Product category: $p-[@type='in']->
Product group: $p-[@type='in']->-[@type='part-of']->
Computers & Accessories
4 6
part of
Product area: $p-[@type='in']->-[@type='part-of']->(2)
part of
7
part of
Smartphones in
12
Tablets
red
Cell Phones & Accessories
15
Phones
5
in 16 GB
Google Nexus 5
black
16
Samsung E1200
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
16
Unbalanced Hierarchies Facts with di erent granularities
Example
Relative dimension speci cation:
Products in categories and groups
Product category: $p-[@type='in']->
Product group: $p-[@type='in']->-[@type='part-of']->
Computers & Accessories
4 6
part of
Product area: $p-[@type='in']->-[@type='part-of']->(2)
part of
7
part of
Smartphones in
12
Tablets
red
Cell Phones & Accessories
15
Phones
5
in 16 GB
Google Nexus 5
black
16
Samsung E1200
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
16
Unbalanced Hierarchies Facts with di erent granularities
Example
Relative dimension speci cation:
Products in categories and groups
Product category: $p-[@type='in']->
Product group: $p-[@type='in']->-[@type='part-of']->
Computers & Accessories
4 6
part of
Product area: $p-[@type='in']->-[@type='part-of']->(2)
part of
7
part of
Smartphones in
12
Tablets
red
Cell Phones & Accessories
15
Phones
5
in 16 GB
Google Nexus 5
black
16
Samsung E1200
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
16
Unbalanced Hierarchies Facts with di erent granularities
Example
Relative dimension speci cation:
Products in categories and groups
Product category: $p-[@type='in']->
Product group: $p-[@type='in']->-[@type='part-of']->
Computers & Accessories
4 6
part of
Product area: $p-[@type='in']->-[@type='part-of']->(2)
Ô Absolute instead of relative dimension speci cation required
part of
7
part of
Smartphones in
12
Tablets
red
Cell Phones & Accessories
15
Phones
5
in 16 GB
Google Nexus 5
black
16
Samsung E1200
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
16
Unbalanced Hierarchies Solution: Pre-process the graph
Computers & Accessories
4 6
part of
part of
7
part of
Smartphones in
12
Tablets
red
Cell Phones & Accessories
15
Phones
5
in 16 GB
Google Nexus 5
black
16
Samsung E1200
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
17
Unbalanced Hierarchies Solution: Pre-process the graph Data Cleansing ˆ Balance hierarchies ˆ Add missing root nodes
14 Computers & Accessories
Consumer Electronics part of
part of
4
6 part of
part of
7
part of
Smartphones in
12
Tablets
Phones
5 part of
Dumbphones
13 red
Cell Phones & Accessories
15
in
16 GB
Google Nexus 5
black
16
Samsung E1200
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
17
Unbalanced Hierarchies Solution: Pre-process the graph Data Cleansing ˆ Balance hierarchies ˆ Add missing root nodes Computers & Accessories
Tagging
2
4
6
part of
Cell Phones & Accessories
part of
ˆ Add attributes for absolute
referencing
3
2 Tablets
7
part of
1
Smartphones in
red
15
12
Phones
5 1
in 16 GB
Google Nexus 5
black
16
Samsung E1200
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
17
OLAP Anomalies It depends on the model: ˆ double counting can occur, if a
cardinality assumption is violated (1:1 vs. 1:n relationship)
© Michael Rudolf |
5 Tablets in
in 128 GB
Phones
7
15
Apple iPad Air
SynopSys: Foundations for Multidimensional Graph Analytics
in 16
black
Samsung E1200
|
18
OLAP Anomalies It depends on the model: ˆ double counting can occur, if a
cardinality assumption is violated (1:1 vs. 1:n relationship) ˆ incompleteness can occur, if a
connectivity assumption is violated
5 Tablets in
in 128 GB
16
5
Phones
7 in 16
128 GB
black
Samsung E1200
Apple iPad Air
15
Apple iPad Air
© Michael Rudolf |
in
15
Tablets in
Phones
7
SynopSys: Foundations for Multidimensional Graph Analytics
black
Samsung E1200
|
18
Conclusion Powerful Mapping of Multidimensional Analytics ˆ Expose well-known concepts and operations ˆ Emphasize challenges posed by graph data
Ô Open up the graph world to Business Intelligence Flexible Work ow for the Big Graph Data Era ˆ No up-front schema design ˆ Adapt to changing data and requirements
Ô What is a fact today can be a dimension tomorrow
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
19
1
Additional Material & References
References I Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and Philip S. Yu. Graph OLAP: Towards Online Analytical Processing on Graphs. In Proceedings of the Eighth International Conference on Data Mining, pages 103 112, Pisa, Italy, December 2008. IEEE. Microsoft Research. The Fourth Paradigm: Data-Intensive Scienti c Discovery. Microsoft Press, 2009. Marko A. Rodriguez and Peter Neubauer. Constructions from Dots and Lines. Bulletin of the American Society for Information Science and Technology, 36(6):35 41, 2010. Yuanyuan Tian and Jignesh M. Patel. TALE: A Tool for Approximate Large Graph Matching. In 2008 IEEE 24th International Conference on Data Engineering, pages 963 972. IEEE, April 2008. Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph Cube: On Warehousing and OLAP Multidimensional Networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 853 864, Athens, Greece, 2011. ACM.
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
2
References II Ning Zhang, Yuanyuan Tian, and Jignesh M. Patel. Discovery-Driven Graph Summarization. In Proceedings of the 26th International Conference on Data Engineering, pages 880 891, Long Beach, CA, USA, 2010. IEEE.
© Michael Rudolf |
SynopSys: Foundations for Multidimensional Graph Analytics
|
3