Apr 21, 2009 - and successful in hosting different kinds of data. S. Sakr (CSE ... Decomposition-Based and Selectivity-Aware SQL Translation. S. Sakr (CSE ...
GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries Sherif Sakr School of Computer Science and Engineering University of New South Wales
The 14th International Conference on Database Systems for Advanced Applications (DASFAA’09)
21 April 2009
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
1 / 25
Motivations Graphs are among the most complicated and general form of data structures. Recently, they have been widely used to model many complex structured and schemaless data such as XML documents, social networks, chemical compounds and business process models. Retrieving related graphs containing a query graph from a large graph database is a key performance issue in all of these graph-based applications. The success of any graph database application is directly dependent on the efficiency of the graph indexing and query processing mechanisms. RDBMSs have repeatedly shown that they are very efficient, scalable and successful in hosting different kinds of data. S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
2 / 25
Preliminaries: Graph Data Model In labelled graphs, vertices and edges represent the entities and the relationships between them respectively. The attributes associated with these entities and relationships are called labels. A graph database D is a collection of member graphs D = {g1 , g2 , ...gn } where each member graph gi is denoted as (V , E , Lv , Le ). V is the set of vertices. E ⊆ V × V is the set of edges joining two distinct vertices. Lv is the set of vertex labels. Le is the set of edge labels.
labelled graphs are classified according to the direction of their edges into two main classes: 1 2
Directed-labelled graphs such as XML, RDF and traffic networks. Undirected-labelled graphs such as social networks and chemical compounds.
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
3 / 25
Preliminaries: Subgraph Search Queries Given a graph database D = {g1 , g2 , ..., gn } and a graph query q, it returns the query answer set A = {gi |q ⊆ gi , gi ∈ D}. A graph q is described as a sub-graph of another graph database member gi if the set of vertices and edges of q form subset of the vertices and edges of gi . Formally, g1 (V1 , E1 , Lv 1 , Le1 ) is defined as sub-graph of g2 (V2 , E2 , Lv 2 , Le2 ) if and only if: 1
For every distinct vertex x ∈ V1 with a label vl ∈ Lv 1 , there is a distinct vertex y ∈ V2 with a label vl ∈ Lv 2 .
2
For every distinct edge edge ab ∈ E1 with a label el ∈ Le1 , there is a distinct edge ab ∈ E2 with a label el ∈ Le2 .
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
4 / 25
Preliminaries: Subgraph Search Queries A
f
e m
B
C
n m
y
B C
A
C z
n
mA
D D
x
m n
C
xB
x
f
m
x
A
e
x
Dn
C
x
A
f
n
C m
D
D m
g2
g1
A
x
C
e
x
x
x n
D
f
z
C A
D
B
A
g2
A
x
C
x e
A z
A
x
A
Ax
m
n
n
x
D
D
B
g3
qq
g3
(a) Sample graph database
(b) Graph query
Figure: An example graph database and graph query
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
5 / 25
Our Approach: GraphREL Relational encoding of graph data. SQL translation of sub-graph search queries. Filtering phase. Optional verification phase.
Partitioned B-tree Indexes. Statistical Summaries. Decomposition-Based and Selectivity-Aware SQL Translation.
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
6 / 25
Relational Encoding of Graph Data The starting point of our relational framework is to find an efficient and suitable encoding for each graph member gi in the graph database D. We use the Vertex-Edge mapping scheme for storing directed labelled graphs with the following structure: Vertices(graphID, vertexID, vertexLabel) Edges(graphID, sVertex, dVertex, edgeLabel)
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
7 / 25
Relational Encoding of Graph Data
A
m
g1
6 y
B
z
C
5
A A
f
g2
x 4
A
B
3
x 4
1 e
C n
C
2
m
m
n
D
x
5
1 n
2 m
D
3
graphID
vertexID
vLabel
graphID
sVertex
dVertex
eLabel
1
1
A
1
1
2
n
1
1
3
m
1
2
3
n
1
4
3
x
1
2
A
1
3
D
1
4
A
1
5
4
x
1
5
C
1
6
5
y
1
5
2
z
1
1
6
m
2
1
2
e
1
6
B
2
1
A
2
2
C
2
2
3
m
2
3
D
2
4
3
m
C
2
4
2
n
2
5
4
x
2
1
5
f
2 2
4 5
B
Vertices Table S. Sakr (CSE, UNSW)
DASFAA’09
Edges Table 21 April 2009
8 / 25
SQL Translation of Graph Queries Filtering Phase: a sub-graph query q consists of a set of vertices QV with size equal m and a set of edges QE equal n is evaluated using the following SQL translation template: SELECT DISTINCT V1 .graphID, Vi .vertexID FROM Vertices as V1 ,..., Vertices as Vm , Edges as E1 ,..., Edges as En WHERE ∀m i=2 (V1 .graphID = Vi .graphID) AND ∀nj=1 (V1 .graphID = Ej .graphID) AND ∀m i=1 (Vi .vertexLabel = QVi .vertexLabel) AND ∀nj=1 (Ej .edgeLabel = QEj .edgeLabel) AND ∀nj=1 (Ej .sVertex = Vf .vertexID AND Ej .dVertex = Vf .vertexID);
Verification Phase: an optional phase which is used to verify that each vertex in the set of filtered vertices for each candidate graph is distinct. It is applied only if more than one vertex of the set of query vertices QV have the same label. This can be easily achieved using their vertex ID. S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
9 / 25
Partitioned B-tree Indexes Partitioned B-tree indexing is a slight variant of the B-tree indexing structure. The main idea is the use of low-selectivity leading columns to maintain partitions within the associated B-tree. In labelled graphs, it is generally the case that the number of distinct vertices and edges labels are far less than the number of vertices and edges respectively. For example, having an index defined in terms of columns (vertexLabel, graphID) can reduce the access cost of sub-graph query with only one label to one disk page. On the contrary, an index defined in terms of the two columns (graphID, vertexLabel) requires scanning a large number of disk pages. Having partitioned B-trees indexes of the high-selectivity attributes achieves fixed execution times which are no longer dependent on the size of the whole graph database. S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
10 / 25
Limitations of SQL-Based Translation Approach An obvious problem of the SQL translation template is that it involves a large number of conjunctive SQL predicates and join operations between the encoding tables. Most of relational query engines will certainly fail to execute the SQL translation queries of medium size or large sub-graph queries because they are too long and too complex (this does not mean they must consequently be too expensive). Therefore, we need a decomposition mechanism to divide this large and complex SQL translation query into a sequence of intermediate queries. Applying this decomposition mechanism blindly may lead to inefficient execution plans with very large, non-required and expensive intermediate results. We use statistical summary information to achieve an efficient decomposition process. S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
11 / 25
Statistical Summaries In general, one of the most effective techniques for optimizing the execution times of SQL queries is to select the relational execution based on the accurate selectivity information of the query predicates. We construct three Markov tables to store information about the frequency of occurrence of the distinct labels of vertices, distinct labels of edges and connection between pair of vertices (edges). Vertex Label
Frequency
Edge Label
Frequency
A
100
a
40
B
200
c
5
C
38
e
28
D
4
l
54
E
50
m
140
L
6
n
3
M
10
o
20
N
250
p
15
O
3
x
8
P
40
y
60
R
55
z
15
Markov Table summary of vertices labels
S. Sakr (CSE, UNSW)
Edge Label Connection
Frequency
ab
3
ac
15
ae
45
ec
14
em
103
la
5
pc
18
px
45
xy
25
xz
2
za
1
Markov Table summary of edges labels
Markov Table summary of pair-wise edge connections
DASFAA’09
21 April 2009
12 / 25
Decomposition-Based and Selectivity-Aware SQL Translation Identifying the pruning points. Calculating the number of partitions. Decomposed SQL translation. Blindly Single-Level Decomposition. Pruned Single-Level Decomposition. Pruned Multi-Level Decomposition
Selectivity-aware Annotations.
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
13 / 25
Decomposition-Based and Selectivity-Aware SQL Translation Identifying the pruning points Each vertex label, edge label or edge connection with low frequency is considered as a pruning point in our relational evaluation mechanism. Given a query graph q, we first check the structure of q against our summary Markov tables to identify the possible pruning points (NPP).
Calculating the number of partitions Having a sub-graph query q requires NJP join operations. Assuming that the relational query engine can evaluate up to number of join operations equal to MJP in one query. The number of partitions (NOP) is computed as: (NJP/MJP)
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
14 / 25
Decomposition-Based and Selectivity-Aware SQL Translation Blindly Single-Level Decomposition If NPP = 0 ⇒ we blindly decompose the query q into NOP partitions. Each partition is translated into an intermediate evaluation step Si . The final evaluation step joins all intermediate evaluation steps and adds the conjunctive conditions of the partition’s connectors.
Pruned Single-Level Decomposition If NPP >= NOP ⇒ we distribute the pruning points across the different intermediate NOP partitions. It ensures a balanced effective pruning of all intermediate results.
Pruned Multi-Level Decomposition if NPP < NOP ⇒ we distribute the pruning points across a first level intermediate results of NOP partitions. An intermediate collective pruned step IPS is constructed by joining all the pruned first level intermediate results. IPS is used as an entry pruning point for the rest (NOP − NPP) non-pruned partitions in a hierarchical multi-level fashion . Each pruning point can be used to prune more than one partition (if possible). S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
15 / 25
Decomposition-Based and Selectivity-Aware SQL Translation S1
S1
S2
S2
FES SQL
FES SQL S1 SQL
S1 - S2 SQLSQL
S2 SQL
(a) NPP > NOP S2
S2
S1
S1
FES SQL S3
S3
S1 SQL
(b) NPP < NOP
S2 SQL
S1 SQL
FES SQL S3 SQL
S2 SQL
S3 SQL
Figure: Selectivity-aware decomposition process S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
16 / 25
Decomposition-Based and Selectivity-Aware SQL Translation Selectivity-aware Annotations For any given SQL query, there are a large number of alternative execution plans. These alternative execution plans may differ significantly in their use of system resources or response time. We use the statistical summary information to give influencing hints for the query optimizers by injecting additional selectivity information for the individual query predicates into the SQL translations of the graph queries. SELECT fieldlist FROM tablelist WHERE Pi SELECTIVITY Si
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
17 / 25
Experimental Results: Performance and Scalability D2kV10E20L40M50 D10kV10E20L40M50 D50kV30E40L90M150 D100kV30E40L90M150
100000
1MB 10MB 50MB 100MB
10000
10000
Execution Time (ms)
Execution Time (ms)
1000
1000
100
100
10
10
1
1 Q4
Q8
Q12
Q16
Q20
Q4
Q8
Query Size
Q12
Q16
Q20
Query Size
(a) Synthetic Dataset
(b) DBLP Dataset
Figure: The scalability of GraphREL. S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
18 / 25
Experimental Results: The effect of using Partitioned B-tree Indexes and Selectivity Injections Synthetic DBLP
Synthetic DBLP
100
40
90
30
70
Execution Times (ms)
Percentage of Improvement (%)
35 80
60 50 40 30
25 20 15 10
20 5
10 0
0 Q4
Q8
Q12
Q16
Q20
Q4
Query Size
(a) Partitioned B-tree indexes
Q8
Q12
Q16
Q20
Query Size
(b) Injection of selectivity annotations
Figure: The speedup improvement for the relational evaluation of sub-graph queries using partitioned B-tree indexes and selectivity-aware annotations. S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
19 / 25
SQLBP: An Application of GraphREL Many of today’s Information Systems are driven by explicit process models. A business process is a set of coordinated activities to achieve a specific business objective. With the rapid and incremental increase in the number of process models, it becomes crucial for business process designers to be able to look up their repository for models efficiently. SQLBP is a query processor for business processes models. SQLBP is based on a new visual query language for business processes called BPMN-Q. The language addresses processes definitions and extends the standard BPMN notations for modeling business processes for its concrete syntax. A BPMN-Q query is considered to be a graph which is going to be matched with process graph(s). S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
20 / 25
SQLBP: An Application of GraphREL
(a) BPMN-Q Elements A
B
C
D
E
(a) A process model
B
//
D
(b) a query with path element connecting nodes B, D
B
C
D
(c ) a sub-graph from process in (a) matching the query in (b)
(b) Example of a BPMN-Q query
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
21 / 25
SQLBP: An Application of GraphREL BPMN-Q Query Editor BPMN-Q query (GraphML)
Model Editor
GraphML for display
SQL-Based Query Processor
Query Results Updates
SQL Script
RDBMS Relational Business Process Repository
Translation Middleware
BPEL
S. Sakr (CSE, UNSW)
XLANG
……….
DASFAA’09
EPC
21 April 2009
22 / 25
Conclusions GraphREL is a purely relational framework to store and query graph data. In principle GraphREL has the following advantages: It can reside on any relational database system and exploits its well known matured query optimization techniques as well as its efficient and scalable query processing techniques. It has no required time cost for offline or pre-processing steps. It can handle static and dynamic (with frequent updates) graph databases very well. The selectivity annotations for the SQL evaluation scripts provide the relational query optimizers with the ability to select the most efficient execution plans and apply an efficient pruning for the non-required graph database members.
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
23 / 25
References [CIDR’03] G. Graefe. Sorting And Indexing With Partitioned B-Trees. In CIDR, 2003. [SIGMOD’03] T. Grust, J. Rittinger, and J. Teubner. Why Off-The-Shelf RDBMSs are Better at XPath Than You Might Expect. In SIGMOD, 2007. [VLDB’04] T. Grust, S. Sakr, and J. Teubner. XQuery on SQL Hosts. In VLDB, 2004. [SIGMOD’07] T. Grust, M. Mayr, J. Rittinger, S. Sakr, and J. Teubner. A SQL:1999 Code Generator for the Pathfinder XQuery Compiler. In SIGMOD, 2007. [VLDB’08] J. Teubner, T. Grust, S. Maneth, and S. Sakr. Dependable Cardinality Forecats for XQuery. In VLDB, 2008. [SVLDB’09] S. Sakr, B. Benattallah, and A. Awad. SQLBP: An SQL-Based Processor for Querying Business Process Models. Submitted to VLDB, 2009. S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
24 / 25
The End
Thank You
S. Sakr (CSE, UNSW)
DASFAA’09
21 April 2009
25 / 25