GraphREL: A Decomposition-Based and ... - Semantic Scholar

GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries Sherif Sakr School of Computer Science and Engineering University of New South Wales

The 14th International Conference on Database Systems for Advanced Applications (DASFAA’09)

21 April 2009

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

1 / 25

Motivations Graphs are among the most complicated and general form of data structures. Recently, they have been widely used to model many complex structured and schemaless data such as XML documents, social networks, chemical compounds and business process models. Retrieving related graphs containing a query graph from a large graph database is a key performance issue in all of these graph-based applications. The success of any graph database application is directly dependent on the efficiency of the graph indexing and query processing mechanisms. RDBMSs have repeatedly shown that they are very efficient, scalable and successful in hosting different kinds of data. S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

2 / 25

Preliminaries: Graph Data Model In labelled graphs, vertices and edges represent the entities and the relationships between them respectively. The attributes associated with these entities and relationships are called labels. A graph database D is a collection of member graphs D = {g1 , g2 , ...gn } where each member graph gi is denoted as (V , E , Lv , Le ). V is the set of vertices. E ⊆ V × V is the set of edges joining two distinct vertices. Lv is the set of vertex labels. Le is the set of edge labels.

labelled graphs are classified according to the direction of their edges into two main classes: 1 2

Directed-labelled graphs such as XML, RDF and traffic networks. Undirected-labelled graphs such as social networks and chemical compounds.

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

3 / 25

Preliminaries: Subgraph Search Queries Given a graph database D = {g1 , g2 , ..., gn } and a graph query q, it returns the query answer set A = {gi |q ⊆ gi , gi ∈ D}. A graph q is described as a sub-graph of another graph database member gi if the set of vertices and edges of q form subset of the vertices and edges of gi . Formally, g1 (V1 , E1 , Lv 1 , Le1 ) is defined as sub-graph of g2 (V2 , E2 , Lv 2 , Le2 ) if and only if: 1

For every distinct vertex x ∈ V1 with a label vl ∈ Lv 1 , there is a distinct vertex y ∈ V2 with a label vl ∈ Lv 2 .

2

For every distinct edge edge ab ∈ E1 with a label el ∈ Le1 , there is a distinct edge ab ∈ E2 with a label el ∈ Le2 .

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

4 / 25

Preliminaries: Subgraph Search Queries A

f

e m

B

C

n m

y

B C

A

C z

n

mA

D D

x

m n

C

xB

x

f

m

x

A

e

x

Dn

C

x

A

f

n

C m

D

D m

g2

g1

A

x

C

e

x

x

x n

D

f

z

C A

D

B

A

g2

A

x

C

x e

A z

A

x

A

Ax

m

n

n

x

D

D

B

g3

qq

g3

(a) Sample graph database

(b) Graph query

Figure: An example graph database and graph query

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

5 / 25

Our Approach: GraphREL Relational encoding of graph data. SQL translation of sub-graph search queries. Filtering phase. Optional verification phase.

Partitioned B-tree Indexes. Statistical Summaries. Decomposition-Based and Selectivity-Aware SQL Translation.

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

6 / 25

Relational Encoding of Graph Data The starting point of our relational framework is to find an efficient and suitable encoding for each graph member gi in the graph database D. We use the Vertex-Edge mapping scheme for storing directed labelled graphs with the following structure: Vertices(graphID, vertexID, vertexLabel) Edges(graphID, sVertex, dVertex, edgeLabel)

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

7 / 25

Relational Encoding of Graph Data

A

m

g1

6 y

B

z

C

5

A A

f

g2

x 4

A

B

3

x 4

1 e

C n

C

2

m

m

n

D

x

5

1 n

2 m

D

3

graphID

vertexID

vLabel

graphID

sVertex

dVertex

eLabel

1

1

A

1

1

2

n

1

1

3

m

1

2

3

n

1

4

3

x

1

2

A

1

3

D

1

4

A

1

5

4

x

1

5

C

1

6

5

y

1

5

2

z

1

1

6

m

2

1

2

e

1

6

B

2

1

A

2

2

C

2

2

3

m

2

3

D

2

4

3

m

C

2

4

2

n

2

5

4

x

2

1

5

f

2 2

4 5

B

Vertices Table S. Sakr (CSE, UNSW)

DASFAA’09

Edges Table 21 April 2009

8 / 25

SQL Translation of Graph Queries Filtering Phase: a sub-graph query q consists of a set of vertices QV with size equal m and a set of edges QE equal n is evaluated using the following SQL translation template: SELECT DISTINCT V1 .graphID, Vi .vertexID FROM Vertices as V1 ,..., Vertices as Vm , Edges as E1 ,..., Edges as En WHERE ∀m i=2 (V1 .graphID = Vi .graphID) AND ∀nj=1 (V1 .graphID = Ej .graphID) AND ∀m i=1 (Vi .vertexLabel = QVi .vertexLabel) AND ∀nj=1 (Ej .edgeLabel = QEj .edgeLabel) AND ∀nj=1 (Ej .sVertex = Vf .vertexID AND Ej .dVertex = Vf .vertexID);

Verification Phase: an optional phase which is used to verify that each vertex in the set of filtered vertices for each candidate graph is distinct. It is applied only if more than one vertex of the set of query vertices QV have the same label. This can be easily achieved using their vertex ID. S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

9 / 25

Partitioned B-tree Indexes Partitioned B-tree indexing is a slight variant of the B-tree indexing structure. The main idea is the use of low-selectivity leading columns to maintain partitions within the associated B-tree. In labelled graphs, it is generally the case that the number of distinct vertices and edges labels are far less than the number of vertices and edges respectively. For example, having an index defined in terms of columns (vertexLabel, graphID) can reduce the access cost of sub-graph query with only one label to one disk page. On the contrary, an index defined in terms of the two columns (graphID, vertexLabel) requires scanning a large number of disk pages. Having partitioned B-trees indexes of the high-selectivity attributes achieves fixed execution times which are no longer dependent on the size of the whole graph database. S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

10 / 25

Limitations of SQL-Based Translation Approach An obvious problem of the SQL translation template is that it involves a large number of conjunctive SQL predicates and join operations between the encoding tables. Most of relational query engines will certainly fail to execute the SQL translation queries of medium size or large sub-graph queries because they are too long and too complex (this does not mean they must consequently be too expensive). Therefore, we need a decomposition mechanism to divide this large and complex SQL translation query into a sequence of intermediate queries. Applying this decomposition mechanism blindly may lead to inefficient execution plans with very large, non-required and expensive intermediate results. We use statistical summary information to achieve an efficient decomposition process. S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

11 / 25

Statistical Summaries In general, one of the most effective techniques for optimizing the execution times of SQL queries is to select the relational execution based on the accurate selectivity information of the query predicates. We construct three Markov tables to store information about the frequency of occurrence of the distinct labels of vertices, distinct labels of edges and connection between pair of vertices (edges). Vertex Label

Frequency

Edge Label

Frequency

A

100

a

40

B

200

c

5

C

38

e

28

D

4

l

54

E

50

m

140

L

6

n

3

M

10

o

20

N

250

p

15

O

3

x

8

P

40

y

60

R

55

z

15

Markov Table summary of vertices labels

S. Sakr (CSE, UNSW)

Edge Label Connection

Frequency

ab

3

ac

15

ae

45

ec

14

em

103

la

5

pc

18

px

45

xy

25

xz

2

za

1

Markov Table summary of edges labels

Markov Table summary of pair-wise edge connections

DASFAA’09

21 April 2009

12 / 25

Decomposition-Based and Selectivity-Aware SQL Translation Identifying the pruning points. Calculating the number of partitions. Decomposed SQL translation. Blindly Single-Level Decomposition. Pruned Single-Level Decomposition. Pruned Multi-Level Decomposition

Selectivity-aware Annotations.

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

13 / 25

Decomposition-Based and Selectivity-Aware SQL Translation Identifying the pruning points Each vertex label, edge label or edge connection with low frequency is considered as a pruning point in our relational evaluation mechanism. Given a query graph q, we first check the structure of q against our summary Markov tables to identify the possible pruning points (NPP).

Calculating the number of partitions Having a sub-graph query q requires NJP join operations. Assuming that the relational query engine can evaluate up to number of join operations equal to MJP in one query. The number of partitions (NOP) is computed as: (NJP/MJP)

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

14 / 25

Decomposition-Based and Selectivity-Aware SQL Translation Blindly Single-Level Decomposition If NPP = 0 ⇒ we blindly decompose the query q into NOP partitions. Each partition is translated into an intermediate evaluation step Si . The final evaluation step joins all intermediate evaluation steps and adds the conjunctive conditions of the partition’s connectors.

Pruned Single-Level Decomposition If NPP >= NOP ⇒ we distribute the pruning points across the different intermediate NOP partitions. It ensures a balanced effective pruning of all intermediate results.

Pruned Multi-Level Decomposition if NPP < NOP ⇒ we distribute the pruning points across a first level intermediate results of NOP partitions. An intermediate collective pruned step IPS is constructed by joining all the pruned first level intermediate results. IPS is used as an entry pruning point for the rest (NOP − NPP) non-pruned partitions in a hierarchical multi-level fashion . Each pruning point can be used to prune more than one partition (if possible). S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

15 / 25

Decomposition-Based and Selectivity-Aware SQL Translation S1

S1

S2

S2

FES SQL

FES SQL S1 SQL

S1 - S2 SQLSQL

S2 SQL

(a) NPP > NOP S2

S2

S1

S1

FES SQL S3

S3

S1 SQL

(b) NPP < NOP

S2 SQL

S1 SQL

FES SQL S3 SQL

S2 SQL

S3 SQL

Figure: Selectivity-aware decomposition process S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

16 / 25

Decomposition-Based and Selectivity-Aware SQL Translation Selectivity-aware Annotations For any given SQL query, there are a large number of alternative execution plans. These alternative execution plans may differ significantly in their use of system resources or response time. We use the statistical summary information to give influencing hints for the query optimizers by injecting additional selectivity information for the individual query predicates into the SQL translations of the graph queries. SELECT fieldlist FROM tablelist WHERE Pi SELECTIVITY Si

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

17 / 25

Experimental Results: Performance and Scalability D2kV10E20L40M50 D10kV10E20L40M50 D50kV30E40L90M150 D100kV30E40L90M150

100000

1MB 10MB 50MB 100MB

10000

10000

Execution Time (ms)

Execution Time (ms)

1000

1000

100

100

10

10

1

1 Q4

Q8

Q12

Q16

Q20

Q4

Q8

Query Size

Q12

Q16

Q20

Query Size

(a) Synthetic Dataset

(b) DBLP Dataset

Figure: The scalability of GraphREL. S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

18 / 25

Experimental Results: The effect of using Partitioned B-tree Indexes and Selectivity Injections Synthetic DBLP

Synthetic DBLP

100

40

90

30

70

Execution Times (ms)

Percentage of Improvement (%)

35 80

60 50 40 30

25 20 15 10

20 5

10 0

0 Q4

Q8

Q12

Q16

Q20

Q4

Query Size

(a) Partitioned B-tree indexes

Q8

Q12

Q16

Q20

Query Size

(b) Injection of selectivity annotations

Figure: The speedup improvement for the relational evaluation of sub-graph queries using partitioned B-tree indexes and selectivity-aware annotations. S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

19 / 25

SQLBP: An Application of GraphREL Many of today’s Information Systems are driven by explicit process models. A business process is a set of coordinated activities to achieve a specific business objective. With the rapid and incremental increase in the number of process models, it becomes crucial for business process designers to be able to look up their repository for models efficiently. SQLBP is a query processor for business processes models. SQLBP is based on a new visual query language for business processes called BPMN-Q. The language addresses processes definitions and extends the standard BPMN notations for modeling business processes for its concrete syntax. A BPMN-Q query is considered to be a graph which is going to be matched with process graph(s). S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

20 / 25

SQLBP: An Application of GraphREL

(a) BPMN-Q Elements A

B

C

D

E

(a) A process model

B

//

D

(b) a query with path element connecting nodes B, D

B

C

D

(c ) a sub-graph from process in (a) matching the query in (b)

(b) Example of a BPMN-Q query

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

21 / 25

SQLBP: An Application of GraphREL BPMN-Q Query Editor BPMN-Q query (GraphML)

Model Editor

GraphML for display

SQL-Based Query Processor

Query Results Updates

SQL Script

RDBMS Relational Business Process Repository

Translation Middleware

BPEL

S. Sakr (CSE, UNSW)

XLANG

……….

DASFAA’09

EPC

21 April 2009

22 / 25

Conclusions GraphREL is a purely relational framework to store and query graph data. In principle GraphREL has the following advantages: It can reside on any relational database system and exploits its well known matured query optimization techniques as well as its efficient and scalable query processing techniques. It has no required time cost for offline or pre-processing steps. It can handle static and dynamic (with frequent updates) graph databases very well. The selectivity annotations for the SQL evaluation scripts provide the relational query optimizers with the ability to select the most efficient execution plans and apply an efficient pruning for the non-required graph database members.

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

23 / 25

References [CIDR’03] G. Graefe. Sorting And Indexing With Partitioned B-Trees. In CIDR, 2003. [SIGMOD’03] T. Grust, J. Rittinger, and J. Teubner. Why Off-The-Shelf RDBMSs are Better at XPath Than You Might Expect. In SIGMOD, 2007. [VLDB’04] T. Grust, S. Sakr, and J. Teubner. XQuery on SQL Hosts. In VLDB, 2004. [SIGMOD’07] T. Grust, M. Mayr, J. Rittinger, S. Sakr, and J. Teubner. A SQL:1999 Code Generator for the Pathfinder XQuery Compiler. In SIGMOD, 2007. [VLDB’08] J. Teubner, T. Grust, S. Maneth, and S. Sakr. Dependable Cardinality Forecats for XQuery. In VLDB, 2008. [SVLDB’09] S. Sakr, B. Benattallah, and A. Awad. SQLBP: An SQL-Based Processor for Querying Business Process Models. Submitted to VLDB, 2009. S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

24 / 25

The End

Thank You

S. Sakr (CSE, UNSW)

DASFAA’09

21 April 2009

25 / 25

GraphREL: A Decomposition-Based and ... - Semantic Scholar

GraphREL: A Decomposition-Based and ... - Semantic Scholar

Suggest Documents

A graph decompositionbased approach for water distribution network ...

A Semantic Analysis - Semantic Scholar

Combining a co-occurrence-based and a semantic ... - Semantic Scholar

a Monoclonal - Semantic Scholar

A Separator - Semantic Scholar

Appendix A - Semantic Scholar

Ausgabe A - Semantic Scholar

(a) (b) - Semantic Scholar

A Separator - Semantic Scholar

a parasite - Semantic Scholar

[a]fl2 - Semantic Scholar

A NOVEL_Masi - Semantic Scholar

A Appendix - Semantic Scholar

a case - Semantic Scholar

a-SMN - Semantic Scholar

cantemir a - Semantic Scholar

A Appendix - Semantic Scholar

Robota, a - Semantic Scholar

(c) (a) - Semantic Scholar

.------ . a -.--.----.-...._--.._ - Semantic Scholar

A(z) - Semantic Scholar

A Separator - Semantic Scholar

A. Burchard - Semantic Scholar

(a) (b) - Semantic Scholar

GraphREL: A Decomposition-Based and ... - Semantic Scholar