efficient, scalable and successful in hosting types of data .... of very large graphs such as the Web graph and social ...... to decide the cheapest execution plan.
Sakr S, Al-Naymat G. Efficient relational techniques for processing graph queries. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(6): 1237–1255 Nov. 2010. DOI 10.1007/s11390-010-1098-z
Efficient Relational Techniques for Processing Graph Queries Sherif Sakr1,2 , Member, ACM, and Ghazi Al-Naymat1 , Member, ACM 1 2
School of Computer Science and Engineering, University of New South Wales, Sydney, Australia Managing Complexity Group, National ICT Australia (NICTA), ATP, Sydney, Australia
E-mail: {ssakr, ghazi}@cse.unsw.edu.au Received February 22, 2010; revised August 19, 2010. Abstract Graphs are widely used for modeling complicated data such as social networks, chemical compounds, protein interactions and semantic web. To effectively understand and utilize any collection of graphs, a graph database that efficiently supports elementary querying mechanisms is crucially required. For example, Subgraph and Supergraph queries are important types of graph queries which have many applications in practice. A primary challenge in computing the answers of graph queries is that pair-wise comparisons of graphs are usually hard problems. Relational database management systems (RDBMSs) have repeatedly been shown to be able to efficiently host different types of data such as complex objects and XML data. RDBMSs derive much of their performance from sophisticated optimizer components which make use of physical properties that are specific to the relational model such as sortedness, proper join ordering and powerful indexing mechanisms. In this article, we study the problem of indexing and querying graph databases using the relational infrastructure. We present a purely relational framework for processing graph queries. This framework relies on building a layer of graph features knowledge which capture metadata and summary features of the underlying graph database. We describe different querying mechanisms which make use of the layer of graph features knowledge to achieve scalable performance for processing graph queries. Finally, we conduct an extensive set of experiments on real and synthetic datasets to demonstrate the efficiency and the scalability of our techniques. Keywords
1
graph database, graph query, subgraph query, supergraph query
Introduction
The graph is a powerful tool for representing and understanding objects and their relationships in various application domains. Recently, graphs have been widely used to model many complex structured and schemaless data such as semantic web[1] , social networks[2], biological networks[3], protein networks[4] , chemical compounds[5-6] and business process models[7] . The growing popularity of graph databases has generated interesting data management problems. Graph querying problems have attracted a lot of attention from the database community, such as: subgraph search[8-14] , supergraph search[15-16] , approximate subgraph matching[17-18] and correlation subgraph query[19-20] . In principle, retrieving related graphs containing a query graph from a large graph database is a key performance issue in any graphbased application. Therefore, it is important to develop efficient algorithms for processing queries on graph databases. Among the many graph-based search operations, in this article we will focus on the following two main types of graph queries which have generated
a lot of interest in practical applications: 1) Subgraph Queries. This category searches for a specific pattern in the graph database. Formally, given a graph database D = {g1 , g2 , . . . , gn } and a subgraph query q, the answer set A = {gi |q ⊆ gi , gi ∈ D}. 2) Supergraph Queries. This category searches for graphs with structures that are fully contained in the input query. Formally, given a graph database D = {g1 , g2 , . . . , gn } and a supergraph query q, the answer set A = {gi |q ⊇ gi , gi ∈ D}. Clearly, it is an inefficient and a very time-consuming task to perform a sequential scan over the entire graph database D to check whether each graph database member gi belongs to the answer set of a graph query q. Hence, there is a clear necessity to build graph indices in order to improve the performance of processing graph queries. Relational database management systems (RDBMSs) have repeatedly shown that they are very efficient, scalable and successful in hosting types of data which have formerly not been anticipated to be stored inside relational databases such as complex objects[21] , spatio-temporal data[22] and XML data[23] . In addition,
Regular Paper 2010 Springer Science + Business Media, LLC & Science Press, China
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6
1238
RDBMSs have shown its ability to handle vast amounts of data very efficiently using its powerful indexing mechanisms. In this article we focus on employing the powerful features of the relational infrastructure to implement efficient mechanisms for processing graph queries. In particular, we present a purely relational framework to speed up the search efficiency in the context of graph databases. In our approach, the graph database is firstly encoded using an intuitive fixed relational mapping scheme after which the graph query is translated into a sequence of SQL evaluation steps over the defined storage scheme. An obvious problem in the relationalbased evaluation approach of graph queries is the large cost which may result from the large number of join operations which are required to be performed between the encoding relations. In order to overcome this problem, we exploit an observation from our previous work which is that the size of the intermediate results dramatically affect the overall evaluation performance of SQL scripts[24-26] . Hence, we construct a metadata and summary layer of graph features knowledge which consists of three main components: 1) A descriptor record for the main features of each graph database member; 2) An aggregated summary of the unique features of the entire graph database members; 3) An index structure of the most prunning (less frequently occurring) nodes and edges in the underlying graph database. Using this layer of graph features knowledge, we can apply effective and efficient pruning algorithms to filter out as many as possible of the false positives graphs that are guaranteed to not exist in the final results of the graph query before passing the candidate result set to a light-weight verification process. Moreover, we attempt to achieve the maximum performance improvement for our relational execution plans by two main ways: a) Utilizing the existing powerful partitioned B-trees indexing mechanism[27] to reduce the access costs of the secondary storage to a minimum[28] ; b) Using the information of the graph features knowledge to influence the decision of the relational query optimizers to make the right decisions regarding selecting the most efficient join order and the cheapest execution plan[29] by providing selectivity annotations of the translated query predicates in order to reduce the size of the intermediate results to a minimum. We summarize the main contributions of this article as follows. • We present a purely relational framework for evaluating graph queries. In this framework, we encode
graph databases using a fixed relational schema and translate the graph queries into standard SQL scripts. • We describe an approach of building a metadata and summary layer about the underlying graph database, called graph features knowledge layer. This layer extracts the main features of each graph database member and keeps information about the most prunning features in the graph database. • We describe efficient mechanisms for processing subgraph and supergraph queries. These mechanisms use the information of the graph features knowledge layer to apply effective and efficient pruning strategies in order to reduce the search space to a minimum. • We exploit the powerful features of the relational infrastructure such as partitioned B-trees indexes and selectivity annotations to reduce the secondary storage access costs and the size of the intermediate results of our SQL scripts to a minimum. • We empirically evaluate the efficiency and the scalability of our techniques through an extensive set of experiments. The remainder of this article is organized as follows. Section 2 defines the preliminary concepts used in this article. Section 3 discusses different relational encoding schemes for storing graph databases. Section 4 presents the components of the layer of the graph features knowledge. Sections 5 and 6 present our mechanisms to utilize the graph features knowledge to achieve an efficient relational processing for subgraph and supergraph queries respectively. Section 7 describes our decomposition mechanism to avoid the complexity of the generated SQL scripts for large graph queries. We demonstrate the efficiency and scalability of our mechanisms by conducting an extensive set of experiments which are described in Section 8. Related work is surveyed in Section 9 before we finally conclude the article in Section 10. 2
Preliminaries
This section introduces the fundamental definitions used in this article and gives the formal problem statement. Definition 1 (Labeled Graph). A labeled graph g is denoted as (V, E, Lv , Le , Fv , Fe ) where V is the set of vertices; E ⊆ V × V is the set of edges joining two distinct vertices; Lv is the set of vertex labels; Le is the set of edge labels; Fv is a function V → Lv that assigns labels to vertices and Fe is a function E → Le that assigns labels to edges. In principle, labeled graphs are classified according to the direction of their edges into two main classes: 1) Directed-labeled graphs such as XML and RDF. 2) Undirected-labeled graphs such as social networks and
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries
chemical compounds. Generally, the following inequalities usually hold: |V | |Lv | and |E| |Le |. This article focuses on the directed-labeled and connected simple graphs (we refer to them as graphs in the rest of the article). However, the algorithms proposed in the article can be easily extended to other kinds of graphs. Definition 2 (Graph Database). A graph database D is a collection of member graphs gi where D = {g1 , g2 , . . . , gn }. In principle, there are two main types of graph databases. The first type consists of a small number of very large graphs such as the Web graph and social networks. The second type consists of a large set of small graphs such as chemical compounds and biological pathways. The main focus of this article is the second type of graph databases. Definition 3 (Subgraph Isomorphism). Let g1 = (V1 , E1 , Lv1 , Le1 , Fv1 , Fe1 ) and g2 = (V2 , E2 , Lv2 , Le2 , Fv2 , Fe2 ) be two graphs, g1 is defined as a graph isomorphism to g2 , if and only if there exists at least one bijective function f : V1 → V2 such that: 1) for any edge uv ∈ E1 , there is an edge f (u)f (v) ∈ E2 . 2) Fv1 (u) = Fv2 (f (u)) and Fv1 (v) = Fv2 (f (v)). 3) Fe1 (uv) = Fe2 (f (u)f (v)). The subgraph search and supergraph search problems in graph databases are formulated as follows. Definition 4 (Subgraph Search). Given a graph database D = {g1 , g2 , . . . , gn } and a subgraph query q, the query answer set A = {gi |q ⊆ gi , gi ∈ D}. Definition 5 (Supergraph Search). Given a graph database D = {g1 , g2 , . . . , gn } and a supergraph query q, the query answer set A = {gi |q ⊇ gi , gi ∈ D}. Fig.1 provides illustrative examples for the graph database querying operations. Fig.1(a) describes the
1239
subgraph search problem while Fig.1(b) describes the supergraph search problem. These two problems are the main focus of this article. 3
Relational Encoding of Graph Databases
The starting point of any relational framework for processing graph queries is to select a suitable encoding for each graph member gi in the graph database D. In principle, several relational encodings can be used to store graph structured databases[30-31] . In this section, we discuss two alternative mapping schemes that can be used in our framework. We compare between their advantages and disadvantages. However, the graph query processing algorithms of this article is independent of the underlying relational schema and can be adapted to deal with any encoding. 3.1
Vertex-Edge Mapping Scheme
The first alternative relational mapping scheme for storing graph databases is the Vertex-Edge mapping scheme. It represents a simple and intuitive encoding scheme where each graph database member gi is assigned a unique identity graphID. Each vertex is assigned a sequence number (vertexID) inside its graph. Each vertex is represented by one tuple in a single table (Vertices Table) which stores all vertices of the graph database. Each vertex is identified by the graphID to which the vertex belongs and the vertex ID. Additionally, each vertex has an additional attribute to store the vertex label. Similarly, all edges of the graph database are stored in a single table (Edges Table) where each edge is represented by a single tuple in this table. Each edge tuple describes the graph database member to which the edge belongs, the id of the source vertex of the edge, the id of the destination vertex of the edge and the edge label. Therefore, the relational storage scheme of our Vertex-Edge mapping is described as follows: • Vertices(graphID , vertexID, vertexLabel), • Edges(graphID , sVertex , dVertex , edgeLabel ). Fig.2(a) illustrates an example of the Vertex-Edge relational mapping scheme of a sample graph database. 3.2
Fig.1. Graph database querying. (a) Subgraph search problem. (b) Supergraph search problem.
Edge-Edge Mapping Scheme
The second alternative relational mapping scheme for storing graph databases is the Edge-Edge mapping scheme. This scheme starts by assigning a unique identifier for each graph database member, vertex and edge in the graph database. Each edge in the graph database is represented by a single tuple in the encoding relation. In addition to the ID of the graph member, each tuple stores the IDs and the labels of the source
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6
1240
and destination vertices. Hence, the relational storage scheme of the Edge-Edge mapping is described as follows: • EdgeEdge (gID, eID, eLabel , sVID, sVLabel , dVID, dVLabel ). Fig.2(b) illustrates an example of the Edge-Edge relational mapping scheme of graph databases.
the duplication of the label information of each vertex in the representing tuple of each edge to which it belongs. In the remainder of this article, we base the discussion of our algorithms on the Vertex-Edge mapping scheme. However, in Subsection 8.2 we compare between the performance of the query processing algorithms using the two alternative storage schemes. 4
Graph Features Knowledge
In order to improve the efficiency of the SQL-based approach for evaluating graph queries, we follow the idea of automatic creation and usage of higher-level data about data (metadata)[32-34] , here referred to as graph features knowledge (GFK ). We construct a layer of summary data which consists of three main components: graph descriptors, aggregate graph descriptors and Nodes-Edges Markov Summaries. We describe these three components in this section. In the next sections, we describe how to make use of these components in order to improve the performance of the processing of graph queries by effectively pruning the search space, reducing the size of intermediate results and influencing the decision of the relational query optimizers to select the most efficient join order and the cheapest execution plan. 4.1 Fig.2. Relational encoding of graph databases. (a) Vertex-Edge mapping scheme. (b) Edge-Edge mapping scheme.
3.3
Comparison and Discussion
In principle, the Edge-Edge scheme can be considered as a denormalized version of the Vertex-Edge scheme. It groups the information of all vertices and edges into one relation. Therefore, on the one hand, the Edge-Edge scheme is considered to be more efficient for querying purposes. For example, retrieving a subgraph which consists of two nodes and an edge connecting them will require a single lookup in the Edge-Edge table while in the Vertex-Edge scheme, it requires joining the Vertices table with the Edges table. In general, let us assume that we would like to retrieve a graph q that consists of m vertices and n edges. Using the Vertex-Edge scheme, the number of joins between the encoding relations is equal to m + n. However, using the Edge-Edge scheme, it is reduced to only n join operations. On the other hand, the Vertex-Edge scheme is more suitable to deal with dynamic (with frequent updates) graph databases. It is more efficient to add, delete or update the structure of existing graphs with no special processing while the Edge-Edge scheme suffers from the update anomalies problem as the result of
Graph Descriptors
This component maintains a description record about the features of each graph database member in the graph database. Each graph descriptor includes the following components: • No. Nodes (NN ) : represents the total number of nodes in the graph; • No. Edges (NE ) : represents the total number of the edges in the graph; • Unique Node Labels (NL) : represents the number of unique node labels in the graph; • Unique Edge Labels (EL) : represents the number of unique edge labels in the graph; • Maximum In-node Degree (MID) : represents the maximum number of incoming edges for a node in the graph; • Maximum Out-node Degree (MOD) : represents the maximum number of outgoing edges for a node in the graph; • Maximum io-node Degree (MD) : represents the maximum number of total number of incoming and outgoing edges for a node in the graph; • Maximum Path Length (MPL) : where a path length in a graph represents a sequence of consecutive edges and the length of the path is the number of edges traversed.
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries
Fig.3 illustrates an example of a graph descriptor table for a sample graph database. In principle, the components of the graph descriptor can vary from one database to another based on the characteristics of the underlying graph database members. For example, we can avoid storing the components which store the number of unique node and edge labels if there is a guarantee that each node or edge in a graph has a unique label. However, we can add some more descriptive fields such as the number of occurrences of the maximum path length or the number of occurrences of the maximum (in/out/io)-node degrees.
Fig.3. Sample graph descriptors table.
On one hand, the more indexed features the higher the pruning power of the filtering phases. However, it also increases the space cost. On the other hand, the less indexed features lead to poor pruning power in the filtering process and consequently resulting in a performance hit in the query processing time. In general, computing the components of our graph descriptor (except the maximum path length which is recursive in nature) can be achieved in a straightforward way using SQL-based aggregate queries. For example, computing the number of nodes graph descriptor component (NN) can be represented using the following SQL query:
1241
SELECT graphID, count(∗) as NN FROM Vertices GROUP BY graphID.
Similarly, computing the number of edges descriptor component (NE) can be represented using the following SQL query: SELECT graphID, count(∗) as NE FROM Edges GROUP BY graphID.
4.2
Aggregated Graph Descriptors
This component represents a compact aggregated version of the graph descriptors in the following form: • (BID, NN, NE, NL, EL, MID, MOD, MD, MPL, graphCount). It groups the graphs with the same features into one bucket identified by its bucket ID (BID). Each bucket has a graphCount attribute which represents the number of the graph database members that share the same features identified by the values of the bucket components. Therefore, the aggregated graph descriptors component can be considered as a guide summary of the features of the graph database members[35] . The size of this aggregated version is usually, by far, less than the size of the descriptor table and can easily fit into the main memory. Based on a defined threshold t, we create an auxiliary inverted list[36] of all the related graph database members for each summary bucket with a number of graph database members (graphCount) which is less than t. The main idea of these inverted lists is to allow for cheap and quick access to the set of the few graphs which share the same features of that bucket. 4.3
Nodes-Edges Markov Summaries
This component uses three Markov tables to store information about the frequency of occurrence of the distinct labels of vertices, distinct labels of edges and
Fig.4. Sample nodes-edges Markov summaries.
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6
1242
connection between pair of vertices (edges) in the whole of the underlying graph database[37] . Fig.4 presents examples of these Markov tables. Due to the requirements of our query processing algorithms (Sections 5, 6 and 7), we are only interested in vertex and edge information with low frequency. Therefore, we summarize these Markov tables by deleting high-frequency tuples up to a defined threshold freq t for each table. This threshold value is defined according to the available memory budget. Moreover, for any vertex label, edge label, pair of vertices which appear in a number of graph database member less than another predefined threshold gct, we create an inverted list with all its related graphs. 5
Relational Processing of Subgraph Search Queries
In this section, we present our query processing and optimization techniques for processing subgraph queries. Subsection 5.1 presents our framework for answering graph queries. Subsection 5.2 describes the general case of the SQL-based evaluation of subgraph queries. Subsection 5.3 shows how the components of the layer of the graph features knowledge are used to effectively improve the query performance. Subsection 5.5 describes using the powerful features of the relational indexing infrastructure to further optimize the features-based evaluation and boost the performance of the query processing. 5.1
Query Processing Framework
Fig.5 depicts an overview for our relational framework for processing graph queries. The processing of graph queries in our approach goes through the following main steps: 1) Database Preprocessing. In this step we analyze the underlying graph database to build the components of the layer of the graph features knowledge (GFK). 2) Determining the Pruning Strategy. The query processor recognize the features of the input graph query and probes the components of the GFK with the query features to identify the pruning strategy of the filtering phase. 3) Filtering. It uses the outcome decisions from probing the GFK to apply an effective and efficient pruning strategy for filtering the graph database members, get rid of many false positive graphs (graphs that are not possible to belong to the query answer) and produce a list of graph database members which are candidate to constitute the query result. 4) Verification. It validates each candidate graph to check if it really belongs to the answer set of the query graph.
Fig.5. Framework of processing graph queries.
Clearly, the query response time consists of the filtering time and the verification time. The less number of candidates after the filtering, the faster the query response time. Hence, pruning power in the filter process is critical to the overall query performance. 5.2
Subgraph Query Evaluation: General Case
Let us assume that we have a subgraph query q consists of a set of vertices QV with size equal m and a set of edges QE equal n. Using the Vertex-Edge mapping scheme (Subsection 3.1), we employ the following SQLbased filtering-and-verification mechanism to compute the answer set of such type of queries. Filtering Phase. In this phase we specify the set of graph database members which contain the set of vertices and edges that describe the subgraph query using the following SQL template: 1
SELECT DISTINCT V1 .graphID
2
FROM Vertices as V1 , . . ., Vertices as Vm ,
3
Edges as E1 , . . ., Edges as En
4
WHERE
5
∀m i=2 (V1 .graphID = Vi .graphID )
6
AND ∀n j=1 (V1 .graphID = Ej .graphID )
7
AND ∀m i=1 (Vi .vertexLabel = QVi .vertexLabel )
8
AND ∀n j=1 (Ej .edgeLabel = QE j .edgeLabel )
9
AND ∀n j=1 (Ej .sVertex = Vf .vertexID AND
10
Ej .dVertex = Vf .vertexID );
where each referenced table Vi (Line 2) represents an instance from the table Vertices and maps the information of one vertex of the set of subgraph query vertices QV. Similarly, each referenced table Ej (Line 3) represents an instance from the table Edges and maps the information of one edge of the set of subgraph query edges QE. f is the mapping function between each vertex of QV and its associated vertices table instance Vi .
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries
Line 5 represents a set of m − 1 conjunctive conditions to ensure that all queried vertices belong to the same graph. Similarly, Line 6 represents a set of n conjunctive conditions to ensure that all queried edges belong to the same graph of the queried vertices. Lines 7 and 8 represent the set of conjunctive predicates of the vertex and edge labels respectively. Lines 9 and 10 represent the topology and connection information between the mapped vertices. Verification Phase. This phase is an optional phase. We apply the verification process only if more than one vertex of the set of query vertices QV have the same label. Therefore, in this case we need to verify that each vertex in the set of filtered vertices for each candidate graph database member gi is distinct. This can be easily achieved using their vertex ID. Although the fact that the conditions of the verification process could be injected into the SQL translation template of the filtering phase, we found that it is more efficient to avoid the cost of performing these conditions over each graph database members gi by delaying their processing , if required, in a separate phase after pruning the candidate list. 5.3
Features-Based Query Optimization
Subsection 5.2 have presented the general case of the SQL-based evaluation of the subgraph queries. In this subsection we show how our query processor uses the components of the graph features knowledge (GFK) to reduce the search space and optimize the query performance in several ways. Given a subgraph query q, the subgraph query evaluation process starts by applying the following two main steps in order to determine its query evaluation strategy: 1) Subgraph Query Bucket Matching. In this step we compute the subgraph query descriptor (qd) then using the aggregated graph descriptors component (AGD). We identify the set of buckets where the features of the graph query are fully contained. The subgraph query descriptor is considered to be fully contained in a bucket if and only if the value of each component in the query descriptor is less than or equal to its associated component in the bucket. This subgraph bucket matching process is defined as follows. Definition 6 (Subgraph Bucket Match). Given an aggregate graph descriptor AGD = {b1 , b2 , . . . , bn } and a subgraph query descriptor qd, the matched bucket set MB = {bi |qd ⊆ bi , bi ∈ AGD } where qd ⊆ bi iff qd .NN bi .NN ∧ qd .NE bi .NE ∧ qd .NL bi .NL∧qd .EL bi .EL∧qd .MID bi .MId ∧qd .MOD bi .MOD ∧ qd .MD bi .MD ∧ qd .MPL bi .MPL. Fig.6 illustrates an example of the matching process of a subgraph query descriptor against the buckets of
1243
an aggregate graph descriptors.
Fig.6. Bucket matching of subgraph query descriptor.
2) Pruning Features Identification. We use the Markov summary tables to determine the most pruning (less frequently occurring) query features as follows: • A vertex v in the subgraph query q is defined as a pruning vertex if its label frequency is less than a predefined threshold freq v . • An edge e in the subgraph query q is defined as a pruning edge if its label frequency is less than a predefined threshold freq e . • Two connected vertices are defined as a pruning pair if their labels connection frequency is less than a predefined threshold freq p . Based on the outcome of the subgraph query bucket matching and the pruning features identification steps, the query evaluation strategy of the filtering phase is determined and executed as depicted in the illustrated Algorithms in Figs. 7 and 8. The details of these two algorithms are described as follows. • Case SubStrategy 1. If the count of the matched buckets is equal to zero then it means that the features of the subgraph query are not contained in any graph database member. Therefore, we end the query evaluation process by returning an empty result set. In this case, using the information of the aggregate graph descriptors component is quite useful as it avoids any access to the underlying graph database. • Case SubStrategy 2. If any of the query pruning features appears in a number of graphs that is less than the defined threshold gct then we unfold its associated inverted list to retrieve and verify its related graphs to compute the answer set of the subgraph query q. If we have more than one prunning feature satisfying this condition then we select the one with the smallest search space. • Case SubStrategy 3. If the count of the matched buckets is greater than zero and the size of each matched bucket (bi .graphCount ) is less than the defined threshold t then we unfold their associated inverted lists to retrieve and verify their related graph
1244
database members in order to compute the answer set of the subgraph query q. • If the conditions of the two cases SubStrategy2 and SubStrategy3 are both satisfied then we compare between their search spaces and apply the one with the smaller search space. • Case SubStrategy 4. If we have a number of matched buckets with a size that is less than the threshold t and a number of matched buckets with a size greater than the threshold t then for each matched bucket with a size that is less than t, we unfold its associated inverted list to retrieve and verify the related Require: q: subgraph query, AD: aggregate descriptors, MVv : Markov table of vertices labels, MVe : Markov table of edge labels, MVp : Markov table of connected pair of vertices, t: threshold of bucket size with an inverted list, gct: threshold of an inverted list size for a Markov table entry Ensure: SubStrategy : integer pruningFeature : Markov table entry mb: array of matched aggregated descriptors entries 1 SubStrategy :=0; 2 qd:= ComputeGraphQueryDescriptor(q) 3 mb:= IdentifySubMatchedBuckets(qd, AD) 4 if mb.length = 0 then 5 SubStrategy:= 1 6 return 7 end if 8 pf := IdentifyPruningFeatures(q, MVv , Mve , Mvp ) 9 if pf.length > 0 then 10 pfSearchSpace := gct + 1 11 for i = 1 to pf.length do 12 if pf [i].graphCount < fbSearchSpace then 13 SubStrategy := 2 14 pruningFeature := pf [i] 15 pfSearchSpace := pf [i].graphCount 16 end if 17 end for 18 end if 19 fbSearchSpace := 0 20 for i = 1 to mb.length do 21 if mb[i].graphCount > t then 22 if SubStrategy = 0 then 23 SubStrategy := 4 24 end if 25 exit for 26 else 27 fbSearchSpace+= mb[i].graphCount 28 end if 29 if SubStrategy = 2 then 30 if pfSearchSpace > fbSearchSpace then 31 SubStrategy := 3 32 end if 33 else 34 SubStrategy = 3 35 end if 36 end for Fig.7. IdentifySubStrategy: Determines the subgraph query evaluation strategy.
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6 Require: D: graph database, q: subgraph query, GD: graph descriptors, AD: agg. descriptors, MVv : Markov table of vertices labels, MVe : Markov table of edge labels, MVp : Markov table of connected pair of vertices, t: threshold of bucket size with an inverted list, gct: threshold of an inverted list size for a Markov table entry Ensure: ASGQ: answer set of the subgraph query (q) 1: IdentifySubStrategy(q, AD, MVv , MVe , MVp , t, gct, SubStrategy, pruningFeature, mb) 2 if SubStrategy = 1 then 3 return empty 4 end if 5 if SubStrategy = 2 then 6 for all g ∈ pruningFeature.InvertedList.graphs do 7 if Verify(g, q) then 8 ASGQ.ADD(g) 9 end if 10 end for 11 return ASGQ 12 end if 13 if SubStrategy = 3 then 14 for i = 1 to mb.length do 15 for all g ∈ mb[i].InvertedList.graphs do 16 if Verify(g, q) then 17 ASGQ.ADD(g) 18 end if 19 end for end for 20 21 return ASGQ 22 end if 23 if SubStrategy = 4 then 24 for i = 1 to mb.length do 25 if mb[i].graphCount t then 26 for all g ∈ mb[i].InvertedList.graphs do 27 if Verify(g, q) then 28 ASGQ.ADD(g) 29 end if 30 end for 31 else 32 cg := SQLFeaturedFilter(D, q, GD, mb[i]) 33 for all g ∈ cg do 34 if Verify(g, q) then 35 ASGQ.ADD(g) 36 end if 37 end for 38 end if 39 end for 40 return ASGQ 41 end if Fig.8. Features-based evaluation of subgraph queries.
graphs where the target graphs are added to the query answer set. For each matched bucket (bs ) with a percentage greater than t, we apply a features-based filtering phase where the bucket features and the graph descriptor component are used to limit the search space. The SQL template of this features-based filtering phase is depicted in Fig.9. In this template, m and n are the numbers of vertices and edges of the subgraph query q respectively. Each referenced table Vi and Ej maps the information of one of the query vertices and query edges respectively. Lines
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries
7∼13 use the graph descriptors information to retrieve only the graph database members which satisfy the features of the matched buckets. It should be noted that each candidate graph database member can belong to only one of the intermediate results based on the matching of its graph descriptor with the unique features of each matched bucket. Thus, the intermediate results are guaranteed to be disjoint and duplicate-free. However, an optional verification phase is required if and only if more than one vertex of the set of subgraph query vertices have the same label. In this case, we need to verify that each vertex in the set of filtered vertices for each candidate graph is distinct. This can be easily achieved using their vertexID. Although the fact that the conditions of the verification process could be injected into the SQL translation template of the features-based filtering phase, we found that it is more efficient to avoid the cost of performing these conditions over each graph database members by delaying their processing (if required ) to a separate phase after pruning the candidate list. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
SELECT DISTINCT V1 .graphID FROM Vertices as V1 , . . ., Vertices as Vm , Edges as E1 , . . ., Edges as En , GraphDescriptors as GD WHERE V1 .graphID = GD.graphID AND GD.NN = bs .NN AND GD.NE = bs .NE AND GD.NL = bs .NL AND GD.EL = bs .EL AND GD.MID = bs .MID AND GD.MOD = bs .MOD AND GD.MD = bs .MD AND ∀m i=2 (V1 .graphID = Vi .graphID ) AND ∀n j=1 (V1 .graphID = Ej .graphID )
AND ∀m i=1 (Vi .vertexLabel = QV i .vertexLabel) AND ∀n j=1 (Ej .edgeLabel = QE j .edgeLabel) AND ∀n j=1 (Ej .sVertex = Vf .vertexID AND Ej .dVertex = Vf .vertexID);
Fig.9. Features-based SQL template for the filtering phase.
5.4
Selectivity Annotation of Query Predicates
The main goal of RDBMS query optimizers is to find the most efficient execution plan for every given SQL query. For any given SQL query, there are a large number of alternative execution plans. These alternative execution plans may differ significantly in their use of system resources or response time. In general, one of the most effective techniques for optimizing the execution time of SQL queries is to select the relational execution plan based on the accuracy of the selectivity
1245
information of the query predicates. For example, the query optimizer may need to estimate the selectivities of the occurrences of two vertices in one subgraph, one of these vertices with label A and the other with label B to choose the more selective vertex to be filtered first. Sometimes query optimizers are not able to select the most optimal execution plan for the input queries because of the unavailability or the inaccuracy of the required statistical information. To tackle this problem, we use our statistical summary information to give influencing hints for the query optimizers by injecting additional selectivity information for the individual query predicates into the SQL queries. These hints enable the query optimizers to decide the optimal join order, utilizing the most useful indexes and select the cheapest execution plan. More specifically, in SubStrategy 4, we use the selectivity information of the query punning features (acquired from the Markov summary tables) to hint the selectivity information of the associated query predicates for the DB2 (our hosting experimental engine) query optimizer using the following syntax: SELECT fieldlist FROM tablelist WHERE Pi SELECTIVITY Si
where Si indicates the selectivity value for the query predicate Pi . These selectivity values range between 0 and 1. Lower selectivity values (close to 0) will inform the query optimizer that the associated predicates will effectively prune the size of the intermediate result and thus they should be executed first. 5.5
Supporting Relational Indexes
Relational database indexes have proven to be very efficient tools to speed up the performance of evaluating the SQL queries. Moreover, the performance of queries evaluation in relational database systems is very sensitive to the defined indexes structures over the data of the source tables. In principal, using relational indexes can accelerate the performance of queries evaluation in several ways[38]. For example, applying highly selective predicates first can limit the data that must be accessed to only a very limited set of rows that satisfy those predicates. Additionally, query evaluations can be achieved using index-only access and save the necessity to access the original data pages by providing all the data needed for the query evaluation. Partitioned B-tree indexes are considered to be a slight variant of the B-tree indexing structure. The main idea of this indexing technique has been represented by Graefe in [27] where he recommended the use of low-selectivity leading columns to maintain the partitions within the associated B-tree. In labeled
1246
graphs, it is generally the case that the number of distinct vertices and edges labels are far less than the number of vertices and edges respectively. Hence, for example having an index defined in terms of columns (vertexLabel, graphID) can reduce the access cost of subgraph query with only one label to one disk page which stores a list of graphID of all graphs which are including a vertex with the target query label. On the contrary, an index defined in terms of the two columns (graphID, vertexLabel) requires scanning a large number of disk pages to get the same list of targeted graphs. Conceptually, this approach could be considered as a horizontal partitioning of the Vertices and Edges table using the high selectivity partitioning attributes[39] . Therefore, instead of requiring an execution time which is linear with the number of graph database members (graph database size), having partitioned B-trees indexes of the high-selectivity attributes can achieve fixed execution time which is no longer dependent on the size of the whole graph database[27-28] . Moreover, by leveraging the advantage of relying on a pure relational infrastructure, we are able to use the ready-made tools provided by the RDBMSs to propose a candidate set of indexes that are effective for accelerating our SQL-based evaluation query workloads[40-41] . For example, we can use the db2advis tool provided by the DB2 engine to recommend efficient indexes for our query workload. Through the use of this tool we can significantly improve the quality of the designed indexes and speed up the query performance by reducing the number of calls to the database engine. Similar tools are also available in most of the commonly used commercial RDBMSs. 6
Relational Processing of Supergraph Search Queries
In this section, we present our query mechanism for processing supergraph queries. Similar to our previously introduced mechanism for processing subgraph queries (Section 5), the information of the graph features knowledge (GFK) plays an effective rule in reducing the search space and optimizing the query performance. Given a supergraph query q we start by applying the following two main steps: 1) Supergraph Query Bucket Matching. A bucket of the aggregate graph descriptors is considered to be matched with the features of the supergraph query descriptor if and only if the value of each component in the query descriptor is greater than or equal to its associated component in the bucket. This supergraph bucket matching process is defined as follows. Definition 7 (Supergraph Bucket Match). Given an aggregate graph descriptor AGD = {b1 , b2 , . . . , bn }
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6
and a supergraph query descriptor qd, the matched bucket set MB = {bi |qd ⊆ bi , bi ∈ AGD } where qd ⊆ bi iff qd .NN bi .NN ∧ qd .NE bi .NE ∧ qd .NL bi .NL∧qd .EL bi .EL∧qd .MID bi .MId ∧qd .MOD bi .MOD ∧ qd .MD bi .MD ∧ qd .MPL bi .MPL. Fig.12 illustrates an example of the matching process of a supergraph query descriptor against the buckets of an aggregate graph descriptors. 2) Pruning Features Identification. This step determines the most pruning (less frequent) query features in an exactly similar way to the one of the subgraph queries (Subsection 5.3). Based on the outcome of these two steps, the supergraph query evaluation strategy is determined as depicted in the illustrated algorithms in Figs. 10 and 11 which are further described as follows. • Case SupStrategy 1. If the count of the matched buckets is equal to zero, we return an empty result set because none of the graph database members has features which are fully contained in the supergraph query descriptor. • Case SupStrategy 2. This case is exactly similar to the subgraph query case SubStrategy 3. It ensures that the size of each matched bucket (bi .graphCount ) is less than the defined threshold t. If so, we unfold the associated inverted lists of the matched buckets to retrieve and verify their related graphs in order to compute the answer set of the supergraph query q. • Case SupStrategy 3. This case handles the situation when we have a number of matched buckets with a size less than the threshold t and a number of matched buckets with a size greater than the threshold t. Similar to SupStrategy 2, for each matched bucket with a size less than t, we unfold its associated inverted list to retrieve and verify the related graphs. However, for each matched bucket (mb s ) with a percentage greater than Require: q: supegraph query, AD: aggregate descriptors, t: threshold of bucket size with an inverted list, Ensure: SupStrategy : integer mb : array of matched aggregated descriptors entries 1 qd := ComputeGraphQueryDescriptor(q) 2 mb := IdentifySupMatchedBuckets(qd, AD) 3 if mb.length = 0 then 4 SupStrategy := 1 5 return 6 end if 7 SupStrategy := 2; 8 for i = 1 to mb.length do 9 if mb[i].graphCount > t then 10 SupStrategy := 3 11 return 12 end if 13 end for Fig.10. IdentifySupStrategy: Determines the supergraph query evaluation strategy.
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries Require: D: graph database, q: supergraph query, GD: graph descriptors, AD: aggregate descriptors, t: threshold of bucket size with an inverted list, Ensure: ASGQ: answer set of the supergraph query (q) 1 IdentifySupStrategy(q, AD, t, SupStrategy, mb) 2 if SupStrategy = 1 then 3 return empty 4 end if 5 if SupStrategy = 2 then 6 for i = 1 to mb.length do 7 for all g ∈ mb[i].InvertedList.graphs do 8 if Verify(g, q) then 9 ASGQ.ADD(g) 10 end if 11 end for 12 end for 13 return ASGQ 14 end if 15 if SupStrategy = 3 then 16 for i = 1 to mb.length do 17 if mb[i].graphCount t then 18 for all g ∈ mb[i].InvertedList.graphs do 19 if Verify(g, q) then 20 ASGQ.ADD(g) 21 end if 22 end for 23 else 24 sqg := ExtractFeaturedSubgraphs(q, mb[i]) 25 for all eqg ∈ sqg do 26 cg := SQLFeaturedFilter(D, eqg, GD, mb[i]) 27 for all g ∈ cg do 28 if Verify(g, q) do 29 ASGQ.ADD(g) 30 end if 31 end for 32 end for 33 end if 34 end for 35 return ASGQ 36 end if Fig.11. Features-based evaluation of supergraph queries.
Fig.12. Bucket matching of supergraph query descriptor.
t, we convert the supergraph search problem into a subgraph search problem. We use the features of the matched bucket (mb s ) to apply a backtracking step which uses the features of the matched bucket to extract from the supergraph query (q) all subgraphs (sqg) with exactly the same features. We then treat each extracted subgraph (eqg) as a
1247
subgraph query where we issue a features-based SQL query using the template presented on SubStrategy 4 to retrieve all graph database members that fulfill the features of both the extracted subgraph (eqg) and the matched bucket (mb s ). Clearly, this backtracking step dramatically reduces the search space and avoids the need for any further verification phase as the extracted subgraphs are guaranteed to be contained in the supergraph query (q). Similarly to SubStrategy 4, an optional verification phase is required if and only if more than one vertex of the set of extracted subgraph query vertices have the same label. Hence, the verification phase ensures that the set of filtered vertices for each candidate graph are distinct. 7
Decomposition Mechanism of Large Subgraph Queries
In Subsection 5.3, we presented our SQL template for the features-based evaluation strategy of subgraph queries. One concern with this one-step translation template is that the generated SQL scripts for large (in terms of the number of nodes and edges) subgraph queries can involve a large number of join operations. As we previously discussed in Subsection 3.3, given a subgraph query q that consists of m vertices and n edges, the number of join operations between the encoding relations based on the Vertex-Edge encoding scheme is equal to m + n while the Edge-Edge encoding scheme involves n join operations. Most of the query engine either have a strict defined limit on the number of join operations they can evaluate in one query or their performance degrades dramatically if the number of join operations exceeds a certain limit. In order to tackle this problem, we employ a decomposition mechanism that divides the one-step SQL translation of large subgraph queries into a sequence of intermediate queries (using temporary tables). Applying this decomposition mechanism blindly can lead to inefficient execution plans with very large intermediate results. Therefore, we use the Nodes-Edges Markov Summaries to perform an effective selectivity-aware decomposition process. Specifically, given a subgraph query q with a required number of join operations equal to NJ, our decomposition mechanism goes through the following steps: • Identifying the Number of the Pruning Features. As we previously mentioned in Subsection 5.3, we consider each query vertex, query edge or a pair of connected query vertices with low frequency as a pruning feature. We check the structure of q against our Markov tables to identify the possible pruning features. We refer to the number of the identified pruning features by NPP.
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6
1248
• Computing the Number of Partitions. Assuming that the relational query engine has an explicit or implicit limit on the number of join operations that can be involved in one query equal to MNJ, the number of required partitions NOP can be simply computed as follows: NOP = NJ /MNJ . • Decomposing the Subgraph Query. Based on the identified number of pruning features (NPP) and the number of partitions (NOP), we choose between one of the following three alternatives: 1) Blinded decomposition: if NPP = 0 then we blindly decompose the subgraph query q into the calculated number of partition NOP where each partition is translated using our translation template into an intermediate evaluation step Si . The final evaluation step FES represents a join operation between the results of all intermediate evaluation steps Si in addition to the conjunctive condition of the subgraphs connectors. The unavailability of any information about effective pruning features could lead to the result where the size of some intermediate results may contain a large set of non-required graph members. 2) Pruned single-level decomposition: if NPP NOP then we distribute the pruning features across the different intermediate NOP partitions. Thus, we ensure a balanced effective pruning of all intermediate results. All intermediate results Si of all pruned partitions are constructed before the final evaluation step FES joins all these intermediate results in addition to the connecting conditions to constitute the final result. 3) Pruned multi-level decomposition: if NPP < NOP then we distribute the pruning features across a first level intermediate results of NOP partitions. This step ensures an effective pruning of a percentage of NPP /NOP% partitions. An intermediate collective pruned step IPS is constructed by joining all these pruned first level intermediate results in addition to the connecting conditions between them. Progressively, IPS is used as an entry pruning feature for the rest of the (NOP − NPP ) non-pruned partitions in a hierarchical multi-level fashion to constitute the final result set. However, the number of non-pruned partitions can be reduced if any of them can be connected to one of the pruning features. In other words, each pruning feature can be used to prune, if possible, more than one partition to avoid the cost of having any large intermediate results. Fig.13 illustrates two examples of our decomposition mechanism where the pruning vertices are marked by solid fillings, pruning edges are marked by bold line styles and the connectors between subgraphs are
marked by dashed edges. Fig.13(a) represents an example where the number of pruning features is greater than number of partitions. Fig.13(b) represents another example where the number of pruning features is less than the number of partitions and one pruning vertex is used to prune different partitioned subgraphs.
Fig.13. Decomposition of large subgraph queries. (a) NPP > NOP . (b) NPP < NOP .
8
Performance Evaluation
In this section, we present an extensive experimental study for our proposed graph querying mechanisms. We conducted our experiments using IBM DB2 RDBMS running on a PC with 3.2 GHz Intel Xeon processors, 4 GB of main memory storage and 250 GB of SCSI secondary storage. 8.1
Datasets
In our experiments we used two kinds of datasets: 1) A real AIDS antiviral screen dataset (NCI/NIH) which is publicly available on the website of the Developmental Therapeutics Program[42]. For this dataset, different query sets are used, each of which has 1000 queries. These 1000 queries are constructed by randomly selecting 1000 graphs from the antiviral screen dataset and then extracting a connected m edge subgraph from each graph randomly. Each query set is denoted by its edge size (m) as Qm . 2) A set of synthetic datasets which is generated by our implemented data generator that follows the same idea presented by Kuramochi and Karypis in [43]. The generator allows the user to specify the number of graphs (D), the average number of vertices for each graph (V ), the average number of edges for each graph (E), the number of distinct vertices labels (L) and the number of distinct edge labels (M ). We use the notation DdEeVvLlMm to represent the generation parameters of each dataset.
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries
8.2
Experiments
8.2.1 Offline Database Preprocessing Cost We first examined the offline preprocessing cost of our approach. Several synthetic datasets have been used in this experiments. Each dataset is denoted as dn where n represents the number of graphs in the dataset. Fig.14 shows the offline preprocessing cost of these datasets. Fig.14(a) shows the disk storage size of the graph features knowledge in comparison to the GIndex [12] and TreePi [13] approaches. The results of this experiment show that the storage size of the graph knowledge features is the highest and it increases linearly w.r.t. the size of the graph databases. Obviously, the main reason behind this is that each graph database member requires a new record in the graph descriptor table while both GIndex and TreePi index only the frequent features (graphs or tree). In fact, the size of the aggregate graph descriptors component is very small and could be neglected. The size increase of this aggregate component is not affected by the increase in the number of graph database members. However, it is affected by the number of unique features of the graph database members. Fig.14(b) represents the running time in constructing the graph features knowledge in comparison to the
Fig.14. Offline database preprocessing cost. (a) Index size. (b) Construction time.
1249
indexing time of GIndex and TreePi. The results of this experiment show that the running time of the graph knowledge features is the lowest. The main reason behind that is the straightforward computation of the features of the graph database member while GIndex and TreePi apply sophisticated mining techniques to choose their frequent indexing features. Moreover, the index construction time of GIndex and TreePi is approximately proportional to the database size, while TreePi is relatively faster due to the fact that frequent subtree mining is simpler than the frequent subgraph mining. The construction time of the graph features knowledge is more stable, scalable and does not increase with the same portion of increase (much less) in the number of graph database member. The main reason behind this is the efficiency of the relational indexing infrastructure in computing the aggregate values that represent the components of the descriptor record of each graph database member. 8.2.2 Subgraph Query Performance One of the main advantages of using a relational database to store and query graph databases is to exploit their well-known scalability feature. Fig.15(a) shows the query performance of our features-based SQL evaluation of subgraph queries. The results of this experiment illustrate the average execution time for the features-based SQL translations of 1000 subgraph queries with different sizes on graph databases with scaling characteristics. In this figure, the running time for the subgraph query processing is presented in the Y -axis, the X-axis represents the different graph databases and the different bars represent the different query sizes. The running time of these experiments include both the filtering and verification phases. However, on average the running time of the verification phase represents 5% of the total running time and can be considered to have a negligible effect on all queries with small result sets. The results of this experiment show that the execution times of our approach performs and scales in a near linear fashion with respect to the graph database and query sizes. This linear increase of the execution time starts to decrease with the larger datasets because of the efficiency of our query optimization techniques. The speedup improvements of these different optimization techniques are discussed in the next set of experiments. Fig.15(b) shows the results of comparing the performance of our approach with the TreePi and GIndex approaches (using the two largest synthetic datasets). The results of this experiment show that our approach is able to outperform the performance TreePi and GIndex approaches due to the well-known salability advantage of the relational
1250
infrastructure in handling large datasets which may not fit in the main memory. In principle, the results of this experiments is expected because the TreePi and GIndex approaches are not designed to handle these very large datasets. Moreover, they are designed to deal with the general subgraph-isomorphism problem.
Fig.15. Features-based evaluation of subgraph queries. (a) Relational scalability. (b) Comparison with TreePi and GIndex.
8.2.3 Speedup Improvements. Fig.16 represents the speedup improvements of different query optimization techniques which are applied in our approach. We evaluated the effect of each optimization technique independently in a separate experiment. The reported percentage of speedup improvements in these experiments are computed using the formula: (1 − G C )% where G represents the execution time of the SQL execution plans with using one of the optimization techniques while C represents the execution time of the SQL execution plans without using this optimization technique. In principle, the results of these experiments confirm the efficiency of our optimization techniques. Fig.16(a) indicates that the speedup improvement introduced by using the graph descriptor component is the highest.
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6
The main reason behind this is that injecting the filtering conditions for components of the matched buckets with the descriptor of the subgraph query dramatically reduces the search space and gets rid of many irrelevant graph database members which can never appear in the answer set of the subgraph query. Fig.16(b) indicates that the speedup improvement of using the partitioned B-tree indexes is quite effective. The main reason behind this is that it effectively reduces the access cost of the secondary storage. As we previously mentioned, in labeled graphs, it is generally the case that the number of distinct vertices and edges labels are far less than the number of vertices and edges respectively. In principle, the effect of partitioned B-tree indexes increases with the increase of the size of the graph database and the percentage of the distinct node/edge labels. Therefore, the effect on the synthetic database is greater than the effect on the AIDS dataset. Fig.16(c) indicates that the speedup improvement of injecting the selectivity annotations of the filtering predicates in our SQL translation scripts. The results of this experiment shows that the bigger the query size, the more join operations are required to be executed and consequently the higher the effect of the selectivity annotations on helping the query optimizer to decide the cheapest execution plan. Fig.16(d) indicates the further speedup improvement which can be achieved by encoding the graph database using Edge-Edge instead of the Vertex-Edge scheme which we used in all of our experiments. In this experiment, the reported percentage of speedup improvements is also computed using the formula: (1 − G C )%. However, here G represents the execution time of the SQL execution plans using the Edge-Edge encoding scheme while C represents the execution time of the SQL execution plans using the Vertex-Edge encoding scheme. In principle, the results of this experiment is quite expected and can be explained fairly easily because of the effective reduction on the required number of expensive join operations between the encoding relations to evaluate the subgraph queries. As we previously discussed in Subsection 3.3, the Edge-Edge scheme can be considered as a denormalized version of the Vertex-Edge scheme which explains why it is more efficient and more suitable for querying purposes of static graph databases. In principle, the measurements of the speedup improvements of different optimization techniques show that the bigger the query size or graph database size, the higher the cost of the query evaluation and consequently the bigger the effect of the different optimization techniques on improving the performance of the query processing.
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries
1251
Fig.16. Speedup improvements. (a) Graph descriptors. (b) Partitioned B-trees. (c) Selectivity injections. (d) Edge-Edge encoding scheme.
8.2.4 Analysis of the Effect of the Threshold Parameter (t) The value of the threshold parameter t decides to create an inverted list of a buck of the aggregate descriptors component if the number of its related graphs is less than or equal to the value of t. In this experiment, we have tested the effect of the value of t on the construction time of these inverted lists and the performance of the query processing. The experiment of analyzing the effect of t on the performance of the query processing time has been executed using a synthetic dataset generated with the following parameters D50kV30E40L90M150. The results show that the smaller the value of t, the shorter the construction time of the inverted list (Fig.17(a)) but the longer the query processing time (Fig.17(b)). The main reason behind this is the smaller value of t, the less number of inverted list will be constructed and therefore most of the matched buckets of any query will not find an associated inverted list and will increase the number of the issued SQL queries to retrieve the candidate graphs which satisfy the features of the matched buckets. However, it is not entirely true that when the value of t is larger, the query response time will always be shorter. The advantage of using a larger value of t is to
avoid accessing the secondary storage of the encoding relations of the graph database. However, the large the value of t, the bigger the search space and the cost of the probing cost of the inverted list to verify the related graphs and identify the target ones. This probing cost can dramatically increase if the memory requirements for retrieving the inverted list is bigger than the available budget. Therefore, there is a point at which the query processing time will become longer when the increase in the inverted list probing cost is greater than the cost of the SQL-based evaluation. Thus, a moderate value of t which takes into consideration the available budget of main memory is actually a better choice. 8.2.5 Supergraph Query Performance Fig.18 shows the query performance of our featuresbased SQL evaluation of subgraph queries. The results of this experiment illustrate the average execution times for the features-based SQL translations of 1000 supergraph queries with different sizes on graph databases with scaling characteristics. The running time of these experiments include both the filtering and verification phases. Similar to the experiment of subgraph queries, the average running time of the verification phase can be neglected as it represents less than 3% of the total running time. The SQL
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6
1252
approach in order to construct the graph features knowledge (Fig.14). Although the size of the graph features knowledge (Fig.14(a)) can be considered as an additional overhead, it is really worth to pay as it significantly improves the performance of query evaluation especially in the particular case of supergraph queries. With these knowledge, evaluating the supergraph queries is quite challenging and expensive for any SQL-based approach as it will require issuing an intermediate evaluation steps for each pair of integer values x and y where x the number of nodes, y the number of edges and x > y. The main reason behind this is that the filtering predicates will not be able consider any information about the features or the topology of the supergraph query or the graphs of the underlying database. Moreover, it has no means to check if these filtered graphs are fully contained in the supergraph query q or not. This will also yield for too many false positive and duplicate graphs in the intermediate results. In our approach, the supergraph bucket matching process (Definition 7) plays a crucial role to solve this problem by identifying the set of target features of the intermediate results and converting the supergraph search problem into an efficient subgraph search problem. 9
Fig.17. Effect of the value of the threshold parameter (t). (a) Inverted list construction time. (b) Query performance.
Related Work
Graph databases and graph query processing has attracted a lot of attention from the database community. Several approaches have been proposed to deal with different types of graph queries[44] . This section gives an overview of the literature classified by the type of their target graph queries. 9.1
Subgraph Queries
Many graph indexing techniques have been recently proposed to deal with the problem of subgraph query processing[8,12,17,45-46] . They can be mainly classified into two main approaches: 1) Mining-Based Graph Indexing Techniques. 2) Non Mining-Based Graph Indexing Techniques. 9.1.1 Mining-Based Graph Indexing Techniques
Fig.18. Features-based evaluation of supergraph queries.
translation scripts of this experiment use different optimization techniques introduced in our approach (Graph Descriptors, Partitioned B-trees and Selectivity Injections). In a previous experiment, we presented the offline preprocessing cost which is required in our
The GIndex [12] index structure uses frequent subgraphs as the basic indexing unit. Given a query graph q, if q is a frequent subgraph, the exact set of query answers containing q can be retrieved directly. If the query graph q is infrequent that means this subgraph is only contained in a small number of graphs in the database. Cheng et al.[8] have extended the ideas of GIndex[12] by using nested inverted-index in a new graph index structure named FG-Index. In this index structure, a
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries
memory-resident inverted-index is built using the set of frequent subgraphs. A disk-resident inverted-index is built on the closure of the frequent graphs. If the closure is too large, a local set of closure frequent subgraphs is computed from the set of frequent graphs and a further nested inverted-index is constructed. The TreePi [13] index structure uses frequent subtrees as the indexing unit. The main note of this approach is that the frequent subtree mining process is relatively easier than general frequent subgraph mining process. Zhao et al.[46] have extended the idea of TreePi to achieve better pruning ability by adding a small number of discriminative graphs to the index structure. 9.1.2 Non Mining-Based Graph Indexing Techniques GraphGrep [9] uses enumerated paths as its indexing feature. For each graph database member, it enumerates all paths up to a certain maximum length and records the number of occurrences of each path. In the query processing, the path index is used to find a set of candidate graphs which contains paths in the query structure and to check if the counts of such paths are beyond the threshold specified in the query. The main limitation of this approach is that the size of the indexed paths could drastically increase with the size of graph database. The GCoding [14] index structure uses a tree data structure to represent the local structure associated with each graph node. A spectral graph code is generated by combining all nodes signatures in each graph. A vertex signature dictionary is built to store all distinct vertex signatures. The filtering phase uses both of the graph codes and local node signatures to avoid introducing many false negatives candidates. The GDIndex [11] technique relies on a structured graph decomposition approach where all connected and induced subgraphs of a given graph are enumerated. A graph of size n is decomposed into at most 2n subgraphs when each of the vertices has a unique label. A hash table is used to index the subgraphs enumerated during the decomposition process. A clear advantage of the GDIndex approach is that no candidate verification is required. However, the index is designed for databases that consist of relatively smaller graphs and do not have a large number of distinct graphs. ClosureTree[47] is a tree-based index structure which is very similar to the R-tree indexing mechanism[48] but extended to support graph matching queries. In this index structure, each node in the tree contains discriminative information about its descendants in order to facilitate effective pruning. Each node represents the graph closure of its children where the children of an internal node are nodes and the children of a leaf node
1253
are database graphs. GString [10] focuses on modeling graph objects in the context of organic chemistry using basic structures (Line, Star and Cycle) that have semantic meaning and use them as indexing features. GString represents both graphs and queries as string sequences and transforms the problem to the subsequence string matching domain. A suffix tree-based index structure is then created to support an efficient string matching process. A key disadvantage of the GString approach is that converting subgraph search queries into a string matching problem could be inefficient especially if the size of the graph database or the subgraph query is large. Additionally, it is not trivial to extend this approach to other domain of applications. 9.2
Supergraph Queries
Although the importance of the supergraph querying problem in many practical applications, it has not been extensively considered in the literature. Very few approaches have been presented to deal with this problem. The cIndex [15] technique uses the subgraphs which extracted from graph databases based on their rarely occurrence in historical query graphs as its indexing unit. It tries to find a set of contrast subgraphs that collectively perform well. An advantage of the cIndex approach is that the size of the feature index is small. However, the query logs may frequently change so that the feature index maybe outdated quite often and need to be recomputed to stay effective. Zhang et al.[16] have proposed an approach for compact organization of graph database members named GPTree. In this approach, all of the graph database members are stored into one graph where the common subgraphs are stored only once. Based on the containment relationship between the support sets of the extracted features, a mathematical approach is used to determine the ordering of the feature set which can reduce the number of subgraph isomorphism tests during query processing. 10
Conclusions
Efficient processing for graph queries plays a critical role in many applications domains which involve complex structures such as: social networks, bioinformatics and chemical compounds. In this paper, we presented a purely relational framework for processing graph queries. Our approach encodes a graph database using a fixed relational schema and translates graph queries into SQL scripts. A main advantage of this approach is that it can reside on any RDBMS and exploits its well known matured query optimization and indexing techniques. Relying on a layer of graph features
1254
knowledge which capture metadata and summary features of the underlying graph database, we presented different mechanisms to effectively prune the search space and achieve efficient and scalable performance for graph queries. Moreover, an effective selectivityaware decomposition mechanism have been presented to tackle the join problem of large subgraph queries. Finally, we conducted an extensive set of experiments on real and synthetic datasets to demonstrate the efficiency and the scalability of our techniques. The results show that our purely relational approach for processing graph queries deserves to be pursued further. In the future, we will conduct further experiments on using our approach with other types of graphs and will explore extending our querying mechanisms to handle other types of graph queries such as similarity queries or reachability queries. For example, adjusting our approach to handle the general subgraph isomorphism problem can be achieved by dropping the filtering predicates on the labels of the query vertices and edges in our SQL templates. Clearly, this will reduce the pruning power of the filtering phase and will increase the number of candidate graphs and the queries response time. It is expected that addressing this problem effectively would require the addition of more sophisticated components to the graph descriptors in addition to the introduction of new summary components to the graph features knowledge layer. In general, different types of graph queries are different in their nature. Hence, the components of GFK will be subject to adjustment based on the target query types. However, the aggregate graph descriptors and aggregate descriptors components are two basic components that are expected to be utilized to support efficient evaluation for different types of queries. Acknowledgments The authors would like to thank Dr. Xifeng Yan and Prof. Jiawei Han for providing GIndex, and Dr. Shijie Zhang and Mr. Meng Hu and Dr. Jiong Yang for providing TreePi. References [1] Manola F, Miller E. RDF primer: World wide web consortium proposed recommendation. February 2004. http://www.w3.org/TR/rdfprimer/. [2] Cai D, Shao Z, He X, Yan X, Han J. Community mining from multi-relational networks. In Proc. the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, Oct. 3-7, 2005, pp.445-452. [3] Yang Q, Sze S. Path matching and graph matching in biological networks. Journal of Computational Biology, 2007, 14(1): 56-67. [4] Huan J, Wang W, Bandyopadhyay D, Snoeyink J, Prins J, Tropsha A. Mining protein family specific residue packing patterns from protein structure graphs. In Proc. the Eighth Annual International Conference on Computational Molecular Biology, San Diego, USA, Mar. 27-31, 2004, pp.308-315.
J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6 [5] Klinger S, Austin J. Chemical similarity searching using a neural graph matcher. In Proc. the 13th European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium, Apr. 27-29, 2005, pp.479-484. [6] Willett P, Barnard J, Downs G. Chemical similarity searching. Journal of Chemical Information and Computer Sciences, 1998, 38(6): 983-996. [7] Sakr S, Awad A. A framework for querying graph-based business process models. In Proc. the 19th International World Wide Web Conference (WWW), Raleigh, USA, Apr. 26-30, 2010, pp.1297-1300. [8] Cheng J, Ke Y, Ng W, Lu A. FG-Index: Towards verificationfree query processing on graph databases. In Proc. the ACM SIGMOD International Conference on Management of Data, San Diego, USA, Aug. 5-9, 2007, pp.857-872. [9] Giugno R, Shasha D. GraphGrep: A fast and universal method for querying graphs. In Proc. the IEEE International Conference in Pattern Recognition (ICPR), Quebec, Canada, Aug. 11-15, 2002, pp.112-115. [10] Jiang H, Wang H, Yu P, Zhou S. A novel approach for efficient search in graph databases. In Proc. the 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, Apr. 15-20, 2007, pp.566-575. [11] Williams D, Huan J, Wang W. Graph database indexing using structured graph decomposition. In Proc. the 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, Apr. 15-20, 2007, pp.976-985. [12] Yan X, Yu P, Han J. Graph indexing: A frequent structure based approach. In Proc. the ACM SIGMOD International Conference on Management of Data, Los Angeles, USA, Aug. 8-12, 2004, pp.335-346. [13] Zhang S, Hu M, Yang J. TreePi: A novel graph indexing method. In Proc. the 23rd International Conference on Data Engineering, Istanbul, Turkey, Apr. 15-20, 2007, pp.966-975. [14] Zou L, Chen L, Yu J, Lu Y. A novel spectral coding in a large graph database. In Proc. the 11th International Conference on Extending Database Technology (EDBT), Nantes, France, Mar. 25-29, 2008, pp.181-192. [15] Chen C, Yan X, Yu P, Han J, Zhang D, Gu X. Towards graph containment search and indexing. In Proc. the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, Austria, Sept. 23-27, 2007, pp.926-937. [16] Zhang S, Li J, Gao H, Zou Z. A novel approach for efficient supergraph query processing on graph databases. In Proc. the 12th International Conference on Extending Database Technology (EDBT), Saint-Petersburg, Russia, Mar. 24-26, 2009, pp.204-215. [17] Tian Y, Patel J. TALE: A tool for approximate large graph matching. In Proc. of the 24th International Conference on Data Engineering (ICDE), Cancun, Mexico, Apr. 7-12, 2008, pp.963-972. [18] Yan X, Yu P, Han J. Substructure similarity search in graph databases. In Proc. the ACM SIGMOD International Conference on Management of Data, Los Angeles, USA, Jul. 31Aug. 4, 2005, pp.766-777. [19] Ke Y, Cheng J, Ng W. Efficient correlation search from graph databases. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(12): 1601-1615. [20] Zou L, Chen L, Lu Y. Top-K correlation sub-graph search in graph databases. In Proc. the International Conference on Database Systems for Advanced Applications (DASFAA), Brisbane, Australia, Apr. 21-23, 2009, pp.168-185. [21] Cohen S, Hurley P, Schulz K et al. Scientific formats for object-relational database systems: A study of suitability and performance. SIGMOD Record, 2006, 35(2): 10-15. [22] Botea V, Mallett D, Nascimento M, Sander J. PIST: An efficient and practical indexing technique for historical
Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36] [37]
[38] [39]
[40]
spatio-temporal point data. GeoInformatica, 2008, 12(2): 143-168. Grust T, Sakr S, Teubner J. XQuery on SQL hosts. In Proc. the 30th International Conference on Very Large Data Bases, Toronto, Canada, Aug. 31-Sept. 3, 2004, pp.252-263. Grust T, Mayr M, Rittinger J et al. A SQL: 1999 code generator for the pathfinder XQuery compiler. In Proc. the 26th ACM SIGMOD International Conference on Management of Data, San Diego, USA, Aug. 5-9, 2007, pp.1162-1164. Sakr S. Algebraic-based XQuery cardinality estimation. International Journal of Web Information Systems (IJWIS), 2008, 4(1): 7-46. Teubner J, Grust T, Maneth S, Sakr S. Dependable cardinality forecasts for XQuery. Proceedings of the VLDB Endowment (PVLDB), 2008, 1(1): 463-477. Graefe G. Sorting and indexing with partitioned B-trees. In Proc. the 1st International Conference on Data Systems Research (CIDR), Asilomar, USA, Jan. 5-8, 2003. Grust T, Rittinger J, Teubner J. Why off-the-shelf RDBMSs are better at XPath than you might expect. In Proc. the 26th ACM SIGMOD International Conference on Management of Data, San Diego, USA, Aug. 5-9, 2007, pp.949-958. Bruno N, Chaudhuri S, Ramamurthy R. Power hints for query optimization. In Proc. the 25th International Conference on Data Engineering (ICDE), Shanghai, China, Mar. 29-Apr. 2, 2009, pp.469-480. Florescu D, Kossmann D. Storing and querying XML data using an RDMBS. IEEE Data Engineering Bulletin, 1999, 22(3): 27-34. Sakr S. Storing and querying graph data using efficient relational processing techniques. In Proc. the 3rd International United Information Systems Conference (UNISCON), Sydney, Australia, Apr. 21-24, 2009, pp.379-392. Beyer K, Haas P, Reinwald B et al. On synopses for distinctvalue estimation under multiset operations. In Proc. the ACM SIGMOD International Conference on Management of Data, San Diego, USA, August 5-9, 2007, pp.199-210. Chakkappen S, Cruanes T, Dageville B, Jiang L, Shaft U, Su H, Zait M. Efficient and scalable statistics gathering for large databases in Oracle 11g. In Proc. the ACM SIGMOD International Conference on Management of Data, Los Angeles, USA, Aug. 11-15, 2008, pp.1053-1064. Graefe G, Fayyad U, Chaudhuri S. On the efficient gathering of sufficient statistics for classification from large SQL databases. In Proc. the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), New York City, USA, Aug. 27-31, 1998, pp.204-208. Goldman R, Widom J. Enabling query formulation and optimization in semistructured databases. In Proc. the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, Aug. 25-29, 1997, pp.436-445. Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Aboulnaga A, Alameldeen A, Naughton J. Estimating the selectivity of XML path expressions for Internet scale applications. In Proc. the 27th Int. Conf. Very Large Data Bases (VLDB), Rome, Italy, Sept. 11-14, 2001, pp.591-600. Graefe G. Query evaluation techniques for large databases. ACM Computing Surveys, 1993, 25(2): 73-170. Agrawal S, Narasayya V, Yang B. Integrating vertical and horizontal partitioning into automated physical database design. In Proc. the ACM SIGMOD Int. Conf. Management of Data, Toronto, Canada, Aug. 31-Sept. 3, 2004, pp.359-370. Agrawal S, S Chaudhuri, Narasayya V. Automated selection of materialized views and indexes in SQL databases. In Proc. the 26th International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, Sept. 10-14, 2000, pp.496-505.
1255
[41] Agrawal S, Chu E, Narasayya V. Automatic physical design tuning: Workload as a sequence. In Proc. the ACM SIGMOD International Conference on Management of Data, Chicago, USA, Jun. 26-29, 2006, pp.683-694. [42] Developmental therapeutics program. NCI/NIH. http://dtp. nci.nih.gov/. [43] Kuramochi M, Karypis G. Frequent subgraph discovery. In Proc. the IEEE International Conference on Data Mining (ICDM), San Jose, USA, Nov. 29-Dec. 2, 2001, pp.313-320. [44] Sakr S, Al-Naymat G. Graph indexing and querying: A review. International Journal of Web Information Systems (IJWIS), 2010, 6(2): 101-120. [45] Sakr S. GraphREL: A decomposition-based and selectivityaware relational framework for processing sub-graph queries. In Proc. the 14th International Conference on Database Systems for Advanced Applications (DASFAA), Brisbane, Australia, Apr. 21-23, 2009, pp.123-137. [46] Zhao P, Yu J, Yu P. Graph indexing: Tree + delta graph. In Proc. the 33rd Int. Conf. Very Large Data Bases (VLDB), Vienna, Austria, Sept. 23-27, 2007, pp.938-949. [47] He H, Singh A. Closure-tree: An index structure for graph queries. In Proc. the 22nd International Conference on Data Engineering (ICDE), Atlanta, USA, Apr. 3-8, 2006, pp.38-52. [48] Guttman A. R-trees: A dynamic index structure for spatial searching. In Proc. the ACM SIGMOD Int. Conf. Management of Data, Minneapolis, USA, Jul. 23-27, 1984, pp.47-57.
Sherif Sakr is a research scientist in the Managing Complexity Group at National ICT Australia (NICTA). He is also a conjoint lecturer in the School of Computer Science and Engineering (CSE) at the University of New South Wales (UNSW), Australia. He received his Ph.D. degree in computer science from Konstanz University, Germany in 2007. His research interests is data and information management in general, particularly in areas of indexing techniques, query processing and optimization techniques, graph data management and the large scale data management in cloud platforms. His work has been published in international journals and conferences such as: PVLDB, SIGMOD, JCSS, JDM, IJWIS, WWW, BPM, TPCTC, DASFAA and DEXA. One of his papers has awarded the Outstanding Paper Excellence Award 2009 of Emerald Literati Network. Ghazi Al-Naymat is a postdoctoral research fellow at the School of Computer Science and Engineering at the University of New South Wales, Australia. He received his Ph.D. degree in May 2009 from the School of Information Technologies at the University of Sydney, Australia. Dr. Al-Naymat’s research focuses on developing novel data mining techniques for different applications and datasets such as: graph, spatial, spatio-temporal, and time series databases. Dr. Al-Naymat has published a number of papers in excellent international journals and conferences.