Graph matching is a key problem in a variety of emerging applications, as it .... 3.2 Graph Pattern Matching using Incremental Views . ..... algorithm mainly designs to answer pattern query without accessing a data graph, instead .... The remaining part of this thesis is structured as follows: chapter 2 discusses problem of.
ANSWERING GRAPH PATTERN QUERY USING INCREMENTAL VIEWS A DISSERTATION submitted in partial fulfilment of the requirements for the award of the degree of Master of Technology in COMPUTER ENGINEERING
BY
KOMAL SINGH Roll No. 3145524 Under the supervision of
Mr. VIKRAM SINGH Asst. Professor
DEPARTMENT OF COMPUTER ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY KURUKSHETRA-136119, HARYANA (INDIA) JUNE, 2016
Department of Computer Engineering National Institute of Technology Kurukshetra-136119, Haryana, India
CERTIFICATE I hereby certify that the work which is being presented in the M. Tech. Dissertation entitled ―Answering Graph Pattern Query using Incremental Views‖, in partial fulfillment of the requirements for the award of the Master of Technology in Computer Engineering is an authentic record of my own work carried out during a period from June 2015 to June 2016 under the supervision of Mr. Vikram Singh, Assistant Professor, Computer Engineering Department. The matter presented in this thesis has not been submitted for the award of any other degree elsewhere.
Date: Place: Kurukshetra
Komal Singh Roll No. 3145524
This is to certify that the above statement made by the candidate is correct to the best of my knowledge.
Date: Place: Kurukshetra
Mr. Vikram Singh Assistant Professor Department of Computer Engineering National Institute of Technology Kurukshetra-136119 (Haryana)
ACKNOWLEDGEMENT I would like to express my sincere gratitude to my supervisor, Mr. Vikram Singh, Assistant Professor Computer Engineering Department, for all his invaluable guidance, and support during the research work. He fore fronted me into the area of databases and helped me to understand every perspective for the principles of good research. His selfdiscipline and hard work influenced me to keep in track this thesis. Moreover, he was a great source of motivation. I would also like to thank our respected H.O.D. Dr. A. K. Singh. Many thanks to Dr. Syed Taqi Ali, Coordinator M. Tech Dissertation Course, for his help, valuable suggestions and advice while evaluating our work time to time. Thanks must also go to M. Tech Dissertation Evaluation Team Dr. S. K. Jain, Dr. Mayank Dave, Mr. Virendra Ranga, for sharing their valuable time. Their encouraging discussions and positive suggestions benefited me extremely for conducting efficient research. Finally, I would like to thank my colleagues, Akshay, Aarzoo, Ramya, Vishal, for their useful talks, their arguments and beneficial discussions. Moreover, special thanks for hearing my problems. I am filled with gratitude to them for all the good time. Finally, on a personal note, I would like to add a profound gratitude to my parents for their unconditional love and continuous encouragement throughout my years of study. This accomplishment would not have been possible without them.
Komal Singh
i
ABSTRACT In recent years, modeling data in graph structure became evident and effective for processing in some of the prominent application areas like social analytics, health care analytics, scientific analytics etc. Most of the data is archived and analyzed in graphstructured database, as a collection of operational data objects, mapped into a labeled graph or a set of labeled directed graphs. Graph database is mapped into a data graph for simplifying the specialized computing or processing task. Now, a user query also needed to be mapped in a graph, known as query pattern graph, and constructed by creating and connecting user required nodes based on links/relationships. To answer a query graph, it is primarily required to find all the candidate matches in the data graph and subsequently improve the precision of retrieved results. Graphs structures are pervasive in large scale analytics, facing the new challenge such as data size, heterogeneity, uncertainty and data quality. Graph matching is a key problem in a variety of emerging applications, as it determines similarity between two graph patterns. The primary objective of a graph pattern matching (GPM) is to determine all the candidate matches of a query pattern on data graph. Traditional GPM approaches rely on inherent isomorphism and simulation for pattern matching.
In this dissertation, first the performance of traditional notions of graph pattern matching is analyzed and observed that, for a real life application, many of them are unable to capture structural or semantic. Moreover, in real-life data graphs constantly bear modifications with frequent and small updates, thus it is challenging to find all matches for a user query of data retrieval with high accuracy over complex graphs. In response to these challenges, an algorithm is devised that revises traditional notions to characterize graph pattern matching using views. The algorithm finds all the candidate matches of a user query on dynamically updated views, called Incremental Views. Each update on the data graph leads to modifications of incremental view set, as algorithm also helps in identification of affected views. Hence, matches for user query always retrieve correct and current data values. Based on this characterization, our approach can efficiently solve graph pattern query problem over both static and dynamic real life data graphs. ii
TABLE OF CONTENTS
List of Figures ………………………………………………………………………
v
List of Tables ……….………………………………………………………………
vi
1. Introduction ................................................................................................................. 1 1.1 Pattern Mining and Graph Mining ........................................................................ 2 1.2 Graph Pattern Matching ........................................................................................ 3 1.3 Related Work ........................................................................................................ 5 1.4 Motivation............................................................................................................. 8 1.5 Aim ....................................................................................................................... 9 1.6 Thesis Organization ............................................................................................ 10 2. Literature Survey ...................................................................................................... 12 2.1 Preliminaries ....................................................................................................... 12 2.2 Graph Pattern Matching Example ...................................................................... 13 2.3 Graph Pattern Matching approaches ................................................................... 16 2.3.1 Structure based approaches ....................................................................... 17 2.3.2 Semantic based approaches ....................................................................... 19 2.3.3 View base approach .................................................................................. 23 2.4 Performance Statistics ........................................................................................ 25 2.5 Challenges in Graph Pattern Matching ............................................................... 28 3. Answering Pattern Query using Incremental Views ............................................. 30 3.1 Graph Pattern Matching using Views ................................................................. 30 3.2 Graph Pattern Matching using Incremental Views ............................................. 33 3.3 Proposed Algorithm ............................................................................................ 34 3.3.1 Example ..................................................................................................... 37
iii
4. Experimental Analysis .............................................................................................. 40 4.1 Experimental Result……………………………………...……………………...42 5. Conclusion ................................................................................................................. 49 Listed Challenges and Open Issues ........................................................................... 50 References ........................................................................................................................ 52 List of Publications…………….……………………………………………………….56
iv
LIST OF FIGURES Figure 2.1: Data Graph of COMPANY Database ........................................................... 15 Figure 2.2: Referential Integrity Constraint on COMPANY schema .............................. 15 Figure 2.3: Pattern Query................................................................................................. 16 Figure 2.4: Match from COMPANY Database ............................................................... 16 Figure 2.5: Classification of Graph Pattern Matching Approaches ................................. 17
Figure 3.1: Proposed approach of Answering Pattern Query .......................................... 36 Figure 3.2: Data Graph ‗G‘, Pattern query ‗P‘ and View set ‗V‘ .................................... 37 Figure 3.3: Match set Se and distance matrix DM‘ on view set V‘ ................................. 38 Figure 3.4: Resultant match ‗M‘ ...................................................................................... 39 Figure 3.5: Updated match set Se‘ and resultant match M‘............................................. 39
Figure 4.1: Data Graph of Company database (drawn in Neo4j interface) ..................... 43 Figure 4.2: Materialized view 1 ....................................................................................... 44 Figure 4.3: Materialized view 2 ....................................................................................... 44 Figure 4.4: Materialized view 3 ....................................................................................... 45 Figure 4.5: Materialized view 4 ....................................................................................... 45 Figure 4.6: Effect of Pattern Query ‗P‘ size on Execution time ...................................... 46 Figure 4.7: Effect of number of edges deleted on Execution time .................................. 47 Figure 4.8: List of affected views when edge deleted ..................................................... 48
v
LIST OF TABLES Table 2.1: COMPANY Database ……………………………………………………….14 Table 2.2: Summary of Graph Pattern Matching Approaches…………………………..24 Table 2.3: Algorithmic Complexities of GPM approaches ……………………………..27 Table 3.1: Notation used in flowchart of algorithm …………………………………….36
vi
Chapter 1 INTRODUCTION
In the present scenario, the computerization has substantially enhanced the capabilities to generate and collect data from diverse sources. There are various real life applications generating the massively scaled data. This tremendous amount of data has been flooded from almost every aspect of daily life. Some of the key sources of generation of such data are peta-scale simulations, the World Wide Web, scientific applications, social media, experimental sensors and devices etc. Similarly, several traditional engineering and scientific practices also generate high orders of data from scientific experiments, engineering observations, system performance measuring, remote sensing and environment surveillance etc. Businesses worldwide generate enormous data sets, including product descriptions, stock trading records, sale transactions, sales promotions, company profiles and customer feedback [1]. Fast development of powerful data collection tools and data storage tools also results in the explosive growth of available data volume. Analyzing such tremendous amount of data is an important need. This explosive growth in data has generated an urgent need for new automated tools and techniques that can intelligently analyze the data and can transform this vast amount of data into useful information and organized knowledge. This extraction of patterns representing knowledge stored in large databases or other massive information repositories is known as knowledge discovery [1]. Knowledge discovery is a whole process of developing tools and techniques whose purpose is to make sense of collected data [2]. This process has several steps. It starts with cleaning of datasets to remove noise and inconsistent data. Next step is integration where datasets are collected from multiple data sources. Then selection of relevant data is done from database and then this data is transformed and consolidated into forms appropriate for mining. This may result of performing aggregation operations. Data mining is next step where data patterns are extracted by applying intelligent methods. Truly interesting patterns representing knowledge are then identified from the extracted patterns and finally these 1
interesting patterns are presented to the user, known as knowledge presentation. Data mining is a crucial step in the process of knowledge discovery [1]. There are two major data mining goals as defined by the goal of the application, and they are namely verification or discovery [2]. Verification is verifying the user‘s hypothesis about data, while discovery is automatically finding interesting patterns. For advanced data analysis and web-based databases, database technology moved towards the development of advance database systems and data mining. Use of data collected in large data repositories is seldom. Data mining tools in such cases works as an expert system and knowledge based technology which lessens the widening gap between data and information by converting data tombs into golden clumps of knowledge [3].
1.1. Pattern Mining and Graph Mining Data mining techniques can be applied to any kind of data as long as the data is meaningful for a target application. These mining functionalities are also used to specify the kind of patterns that has to be determines in data mining tasks. When talking about pattern mining, it mainly consists of data mining algorithms to discover interesting and useful patterns in database. Pattern mining algorithms can be applied on various types of data such as transaction databases, graph databases, sequence databases, spatial data, streams, strings etc [1]. A pattern mining application is objected to discover pattern, in form of subgraph, sequential patterns, periodic patterns, associations, indirect associations, lattices, trends, rules, etc. Knowledge to be mined may take many forms from periodic patterns of transactions to complicated structural patterns of interrelated transactions. For extracting such knowledge, it is required to represent the data in a form that not only captures the relational information but also supports efficient and effective mining of this data and comprehensibility of the resulting knowledge. Hence, for this purpose graphs are used which sufficiently support all aspects of the relational data mining process [4]. Using a graph for representing the data and mined knowledge supports direct visualization and increased comprehensibility of the knowledge [5].
2
A graph is used to represent a set of objects in the form of vertices/nodes where some pairs are connected by direct links in the form of edges/relationship. Hence a graph is a powerful tool for modeling database objects and their relationships among data items in various application domains [4]. Drawing out useful knowledge from such data graph is known as graph mining. Knowledge mining from such data graph is often as finding pattern or structure a core activity and resultant graph. Pattern or structure is a subgraph of graphical data derived by applying data mining techniques. Although, a graph mining approach is different from data mining, as it retain data and structural information among the data objects as well in data graph [5]. Structural information is helpful as it represents the relationship among graph entities. Typically, graph mining applies structural knowledge on sub graphs to derive resultant graph. Graph mining is broadly a multi-step and iterative process. Graph matching problem is one such method of finding out the importance of a subgraph by evaluation the structural equivalence [4].
1.2.
Graph Pattern Matching
Tremendous amount of data in tera-bytes and peta-bytes is generated on daily basis from almost every aspect of life like from society, business, medicine, science and engineering. Modeling data in graph structure became evident and effective for processing in these prominent application areas because graph is a powerful tool for representing and understanding objects that are inter-related with each other [4]. Hence as for as processing and analytics concern, there are real life graphs everywhere. Therefore, graph databases have become more popular, as this database with an explicit graph structure. A data graph is a collection of data items stored into a graph database and stores data in the nodes and relationships of a graph. Nodes represent the actual data and relationships represent the connection among these nodes. For data of any momentous volume or value, graph database is best suited way of modeling. The prime challenge in graph DB is to understand the semantic relationship among its ingredient DB elements. In the initialization of a graph database index-free adjacency are preferred, in which connected nodes are physically linked to each other [6].
3
In most application spheres, one may want to find interrelationship between two objects both represented via different graph models [4]. In graph database models, data structures for the database schema and instances are modeled in the form of graphs or their generalizations [4]. These graph models try to trounce the inherent limitations in traditional data models, as interconnectivity among data items is a vital facet. Development of graph models provides an opportunity to model related problems in high performance computing [4]. In this direction, conducted research mainly categories in on two types of graph databases. The first type consists of very large graphs, such as suitable in the Web graph and social networks. Querying in such graph databases include finding the best connection between a given set of nodes and finding subgraphs that match a given query pattern [7] [8] [9]. Another type, graph database is a database that consists of a large set of small graphs, such as suitable in chemical compounds [10] and bioinformatics [11]. Typically, querying on for this type of graph database include subgraph queries and similarity queries. A subgraph query retrieves all those graphs in the database that are super-graphs of a given query graph [7][12][13][14], while a similarity query retrieves all those graphs that are structurally similar to a given query graph [12][13]. Query processing is one of the core activities in data processing and analytics. However, the performance of query processing on graph databases is still inadequate due to the high complexity of processing on graph data. As a result, it is important to design efficient algorithms for answering pattern queries on graph database. In this thesis, algorithm is derived for answering pattern query using traditional notion related to graph pattern. This algorithm mainly designs to answer pattern query without accessing a data graph, instead via materialized views. The process of evaluating the structure similarity of graphs is referred as graph matching. Graphs and graph matching algorithms serve as a powerful tool in the process of search and comparison due to their efficiency and easy utility in form of representation [15][16]. A graph pattern matching (GPM) approach is to find subgraphs of data graph that are match for a query graph (pattern) [[1]. Graph matching is a complex problem, as various level of graph modeling is involved. It became one of the focused problems in a diversity 4
of emerging domains such as computer vision, artificial intelligence, computer aided design, information retrieval, data mining, knowledge discovery, mathematical graph theory, biology and electronics. The research efforts made into pattern matching in the past few years go beyond its application in computer science, as it also spans in various other research communities [7][4][17]. Graph matching systems can roughly be divided into systems matching structure in an exact manner and systems matching structure in an error-tolerant way. Although exact matching offers a rigorous way to describe the graph matching problem in mathematical terms, it is generally only applicable to a restricted set of real-world problems. Error tolerant graph matching is able to cope with strong innerclass distortion, which is often present in real world problems, but is generally computationally less efficient. Although the objective of graph pattern matching is to determine the instances or occurrences of an explicit pattern in a query pattern, whereas the primary objective of graph mining is to determine a set of most common or most "interesting" patterns in a graph. Among a list of application areas of graph pattern matching, some of the prominent areas include finding interested patterns by scientists in biological networks (for instance, protein-protein interaction networks), finding research collaboration information (like citation links analysis) from bibliographic data, relationships and their proximity between source code analysis and people in a social network. Substantial amount of research has been done in many application areas of graph pattern mining. Many algorithms are proposed based on the notion of subgraph isomorphism for frequent subgraph mining, exact pattern matching and pattern matching with wildcards based on distance [18]. Graph pattern matching is widely used to search and analyze, e.g., social graphs, biological data and transportation networks [19].
1.3.
Related Work
Due to variations in graph characteristics and application requirements, graph matching is not a single problem, but a set of related problems. Graph pattern matching is emerging
5
to be useful in wide range of areas [9][17][20]. Now, we categorized some of the fundamental works, based on approach they work as follows: Subgraph Isomorphism and Homomorphism. All the basic graph matching approaches are defined in terms of subgraph isomorphism and the approach either finds the isomorphic subgraph matches for the given input pattern, or returns inexact isomorphic subgraph matches for the pattern with the best matched nodes [21]. Subgraph isomorphism can be determined by means of a brute-force tree-search enumeration procedure [18]. As the problem is intractable, imprecise solutions have also been designed which finds inexact matches [7][22]. [23] Presents an algorithm for subgraph homomorphism. In place of finding a homomorphic image of a pattern graph in a data graph, it decomposes the pattern matching problem along connected components of the pattern graph and finds homomorphic images of every connected component of the pattern in the data graph. Edge to path mapping is also allowed in extensions of subgraph isomorphism. One of the extensions is designed for xml schema mapping [24] and other is designed for Web site matching [25], but both are still np-complete. Algorithm provided in [26] process graph pattern matching as a sequence of reachability joins, also known as R-join, upon a graph database. It stores data graph in tables and for reachability checking it uses cluster-based join-index with graph codes. They also provided an optimization approach to optimize a sequence of R-joins.
Graph Simulation. Practical application areas of graph pattern matching on the basis of graph simulation include web site classification [8], structural index [27] and process calculus [28]. An algorithm was proposed in [29] for computing graph simulation on a single graph. For reactive systems, an algorithm for computing similarity relations of labeled graphs based on simulation is provided in [19]. In case of infinite graphs, a symbolic-checking procedure is provided in [29] that terminates if a finite similarity relation exists. An extension of graph simulation is provided which incorporate regular expressions as constraints of edge of pattern graph [30]. In spite of intractability of graph simulation, approximate matching has been studied which finds inexact solutions by allowing node/edge mismatch [30]. Another extension was addressed as a notion of weak similarity [19], quite similar to bounded simulation and focuses on subgraph similarity It 6
is also an np-complete problem. Edge-to-edge mapping of subgraph isomorphism and graph simulation are not able to specify association among pair of nodes via a path of unpredictable length in a data graph because they impose topological constraint too strictly. For these reasons, traditional defined computation notions defined for graph pattern matching has to be revised to precisely and effectively identify reasonable matches in real-life graphs [31][32]. A polynomial time notion was provided known as bounded simulation to capture patterns commonly found in practice by allowing bounds on the number of hops in them [31]. Notions of simulation in [33][30][29] failed to capture topology of graphs and may yield in false matches or matches having too large relation. [31] Rectify this problem by imposing additional constraints on graph simulation. [8] Provides a heuristic approach for calculating pattern matching. It also surveyed the notions and techniques used during the automatic generation of applicationspecific heuristics that may be integrated into an extensible, general sense pattern matching algorithm.
Incremental based. Incremental algorithms have been developed for various applications where data graphs are dynamic in nature [34][35]. The complexity of an incremental algorithm as compared to simulation does not depend on the size of the entire input. The complexity of the algorithm depends on the size of the area affected by updates. A notion of semi-boundness was proposed in [34]. In case of social graphs having millions of nodes and edges, it is infeasible to conduct bounded simulation because of its cubic-time complexity. To overcome this, an incremental graph pattern matching was introduced in [9] for computing exact matches in social graphs. For the problems based on shortest-path, incremental algorithms were also provided [35][36]. Incremental algorithms have also been developed for bi-simulation [37][38]. Approximate algorithms have been studied for subgraph search based on the notion of incremental [39]. In it database is provided as a graph stream and it determines whether a pattern is contained or not.
View based. Query processing based on views can be applied in two ways: query rewriting and query answering [40][41][42]. User query in the form of graph and a set of 7
views is provided in both of them. Query rewriting is to reformulate user query into another form which must be equivalent to user query and only refer to views present in the view set, and query answering is to find another existing query which must be equivalent to user query and only refer to views present in the view set. The main difference between them is that the former requires the reformulated query in a fixed language while the latter one imposes no constraints on the equivalent query. In [43] a query answering approach based on the notion of bounded simulation for graph pattern matching is provided.
Distributed graph simulation. A numerous number of graph systems have been developed for storing and querying distributed graphs [4][44][45][46] for example Facebook TAO [46] and Microsoft Trinity [4]. Several of these algorithms give performance guarantees via partial evaluation or message passing. Pregel [45] is a distributed graph system based on synchronized message passing while GraphLab [44] is an asynchronous parallel-computation framework for graphs, optimized for scalable machine learning and data mining algorithms. It is hard to assure provable performance bounds in these frameworks, especially for graph pattern matching in arbitrarily partitioned graphs. Algorithms in [47][48] also study distributed graph simulation by scheduling inter-site message passing. These algorithms are the first effort, which integrate partial evaluation and message passing for graph pattern matching. It provides provable performance guarantees on data shipment and response time by reducing query processing time. In [9], a distribute graph pattern matching approach for large social graphs has also been developed.
1.4.
Motivation
Graphs designed for a real life application consist of millions of nodes and edges, so when applying traditional matching algorithms on such huge graphs, they may inefficient in producing the results. The approach may consume high space and time values, as for each processing instance entire graph has to be loaded into primary memory. Notion of data views has been proficiently used to handle such complexity in data pre-processing. 8
As views are efficiently used for querying on relational and semi-structured data, they can be adapted for querying graph data. Conventional application areas of views are data warehousing system, data integration system, access control and semantic caching etc. They uphold scale independence of data, i.e. can query data independent of its size. [24] [34]. When talking about real life graphs, especially social graphs, they are typically large and often distributed. When combined with graph pattern matching, can provide effective method to query large graph data. Answering graph queries using views simply computes query result by evaluating an equivalent query which only refers to the views present in the view set [9][[22]. Views are usually created on frequent queries hired by database user. Database designer create views to minimize the compilation of frequent queries and thus to enhance reusability. Views on the relational and semi-structured data are non-editable and virtual. In the relational and semi-structured data, views are maintained on data repository. In case of graph data, views can be exploited for answering a graph pattern query. For graph data, view set is a set of frequent queries on the data graph by database users and it is initialized during the initialization phase along with data graph [24]. To conduct graph pattern matching by exploiting views, first and foremost answers to be found for following questions. (1) How to make a decision whether a match to query pattern graph can be found by using views defined on that data graph? (2) If so, which views in view set should we choose to answer pattern query? (3) How to efficiently compute match for a pattern from view set? (4) How to find updated match when there are frequent changes in data graph. This thesis is an effort towards finding a novel solution for the above questions.
1.5.
Aim
There are two key contributions of the thesis, first is to provide the overall comparative performance analysis of various graph pattern matching approaches used for query processing in recent years by various researchers and highlighted the various modern day computing challenges faced by graph pattern matching in query processing. Second and important, this thesis scrutinizes few relevant questions for answering graph pattern 9
queries using graph views. For this, some of the traditional notions of graph pattern matching are revisited which are mainly defined in terms of simulation. Also discussed below: (1). The notion of pattern containment is used to deal with sheer size of data graph. It extends traditional notion of query containment when applied on views. It states that answer to a graph query can be found by using a set of views defined on data graph if and only if that pattern query is contained in the view set. (2). To decide which views from the view set has to be used for answering pattern query; minimum containment problem is identified for pattern containment. The problem of minimum pattern containment is to find a minimum size subset of views from the view set that contains pattern query. (3). To find match to a pattern query, the notion of graph simulation is revised. Instead of finding match from data graph, matches are found from view set by mapping nodes and edges of the pattern to the view definition. (4). Incremental pattern matching approach is used to find new matches of the pattern from the updated data graph. It finds new matches for the updated data graph by finding changes in the previously computed matches. Hence, once matches are computed on the entire graph through a batch matching algorithm, then applying this approach will incrementally identify changes in matches or new matches in response to changes in data graph.
1.6.
Organization of Thesis
The remaining part of this thesis is structured as follows: chapter 2 discusses problem of graph pattern matching including a classification of graph pattern matching approaches. The performance analyses of various graph pattern matching approaches are summarized along with their algorithmic complexity. It also highlights the inherent challenges for existing graph pattern matching approaches. At the end of the chapter motivation for this thesis is mentioned. Chapter 3 first describes mainly the concept of views in graph pattern matching. This chapter also revisits the traditional notions of graph pattern matching and formally 10
defines the incremental views notion for the same. It then describes the challenges in traditional notions and how the notions of incremental views solve this problem, with working example and algorithmic explanation. Chapter 4 provides the pseudo code of the algorithm along with its implementation details and experimental analysis results. Chapter 5 concludes the thesis and briefly discusses the current challenges in the existing approaches and highlights some of the possible future direction for the research.
11
Chapter 2 LITERATURE SURVEY Graph pattern-matching is a generalization of string matching and two-dimensional pattern-matching. The user query entered in the form of string is first converted in a two dimensional pattern and then its match is found from the database. Hence it offers a natural framework for the study of matching problems upon multi-dimensional structures. Given a data graph G and query pattern graph P, graph pattern matching problem is to determine the entire subgraph in a data graph G that matches user query pattern graph P. In this thesis we consider directed and labeled graph G = (V, E, ƒA) [23]. If we deleted some nodes from graph G together with their incident edges, we obtain a subgraph G’ = (V’, E’, ƒ’A).
2.1.
Preliminaries
Fundamental notations related to pattern graph and matching are discussed below, Data Graph: Database is represented in the form of node labeled directed graph known as data graph. It is denoted by G = (V, E, ƒA), with V and E ⊆ V ×V are data graph nodes and set of data graph edges from node v to v΄ denoted by (v,v’). Function ƒA(·) relates each node v in V with a tuple ƒA (v) = (A1=a1,….,An = an), ai is a constant, and Ai is an attribute of v. Here, A represents a list of attributes or set of predicates defined on nodes of the graph. Pattern Graph: A user query is represented in the form of directed graph known as pattern graph. It is denoted by P = (VP, EP, ƒv, ƒe), with VP is set of pattern nodes and EP is the set of pattern edges; function ƒv (·) is defined on VP such that for each node u ∊ VP, ƒv (u) is a predicate of u and function ƒe (·) is defined on EP. Matching: Matching is defined as a problem of finding a subgraph of a data graph G which matches pattern graph P. It is denoted by M(P,G). 12
Maximally Contained Graph: Consider a pattern query PS = (VP, EP, ƒv), and a view set 𝒱 = {V1,...Vn}, where Vi = (Vi, Ei, ƒi) denotes the definition of view Vi. PS is contained in 𝒱, denoted by PS ⊆ 𝒱, if there exists a mapping 𝜆 from EP to powerset 𝒫(∪𝑖∊ [1,n] Ei), such that for all data graphs G, the match set Se ⊆ ∪𝑒′∊𝜆(𝑒) Se for all edges e ∊ EP. Paths: A path is a sequence of nodes denoted in the form of (v1,.. vn) such that (vi , vi+1) is an edge in G for each i ∊ [1,n-1]. Distance: The distance between pair of nodes (u, v) in G is shortest path from u to v, denoted by dist(u, v). Diameter: Denoted by dG is to find longest shortest path between all pair of nodes denoted by (u, v) in G, i.e., dG = max(dis(u, v)).
2.2.
Graph Pattern Matching: Example
Consider a company database COMPANY and index adjacency for storage, i.e. the database is actually stored in the form of relational database and while querying it, the database will be represented in the form of graph. When storing the COMPANY database, the database is stored in the relational tables: EMPLOYEE, DEPT, PROJECT, and WORKS_ON as shown in table 2.1. EMPLOYEE contains details of the employee like: name represented by NAME, employee id represented by Ssn, employee id of its supervisor represented by Super_Ssn and department id in which employee works represented by Dno. DEPT consist of details of department, dno as primary attribute, PROJECT refer to projects in company uniquely represented by Pno. Table WORKS_ON specifies which employee id represented by Essn is working on which project ID represented by Pno.
13
Table 2.1: COMPANY Database EMPLOYEE
WORKS_ON
Name
Ssn
Super_Ssn
Dno
Essn
Pno
John
12345
33455
5
12345
1
Franklin
33455
88655
5
12345
2
Alicia
99877
98765
4
66844
3
Jennifer
98765
88655
4
45345
1
Ramesh
66844
33455
5
45345
2
Joyce
45345
33455
5
33455
2
Ahmad
98798
98765
4
33455
3
James
88655
Null
1
33455
10
33455
20
99877
30
99877
10
98798
10
98798
30
98765
30
98765
20
88655
20
DEPT Dno 1 4 5
PROJECT DName CS CH CE
Pno 1 2 3 10 20 30
PName DBMS ADA RQE TQP TOC OS
Schema of the database is represented in the form of graph as shown in Figure 2.2. Here, only one label is defined on the nodes of the graph i.e. Pno, as it is relevant to the use query as of now. Pattern query consisting similar values in the graph is easily mapped on the data graph. A data graph always consists of data values in the nodes and as per summarized data values. The key constraints and integrity constraints are specified on individual relations. The referential integrity constraint representation of the database is shown in Figure 2.3. The referential integrity constraint specified between two relations is used to maintain the consistency among tuples in the two relations. 14
Figure 2.1: Data Graph of COMPANY Database
Figure 2.2: Referential Integrity Constraint on COMPANY schema
A user query is to ‗find the list of employees working on project number 20’ and ‗employees working under them on project number 10 or 20’. In graph pattern matching, query hired by user is first converted in the form of pattern graph. The graphical representation of the user query is known as pattern query. Hence, this user query is represented in the form of pattern graph as shown in figure 2.3. Now, the objective of graph pattern matching techniques is to find the all the candidate matches of the pattern query in data graph. The pattern query graph is imposed on data graph to find the candidate matches; in this process either structural or semantic matches are identified. So for a query pattern graph, there are multiple candidate matches possible. A structural pattern matching approach tries to find the matches according to the graph structure of
15
pattern and data graph, while a semantic based matching is defined over the graph semantics. One of the resultant matches for the pattern query (figure 2.3) in the data graph (figure 2.1) is shown in figure 2.5.
Figure 2.3: Pattern Query
Figure 2.4: Match from COMPANY Database
2.3.
Graph Pattern Matching approaches
There is diversity of problems that build on graph matching, as it is a complex processing task. There are number of approaches for graph pattern matching problem. In this section we overview the techniques for graph pattern matching problem. A number of graph pattern matching techniques been developed for finding the pattern matches, despite of the complex structure of graph and variety of its properties [49]. We classify all the approaches based on graph structure and semantics similarity metrics [7], shown in Figure 2.5.
16
Figure 2.5: Classification of Graph Pattern Matching Approaches
2.3.1. Structure based approaches All graphs share the basic structure elements, vertices and edges. This class contains traditional graph matching methods that find matches measuring the similarity of graphs based on structure [7] i.e. focus on the specific structure property, e.g. restriction on the number of edges, number of isomorphs etc., measuring the similarity of the graphs based on subgraph isomorphism (maximum common subgraph) or simulation. Subgraph Isomorphism: This class of methods is used to find exact graph matching. A graph isomorphism from a graph G to graph G‘ is a bijective mapping from the nodes of G to nodes of G‘. Similarly, subgraph isomorphism from G to G‘ is an isomorphism from G to a subgraph of G‘. Hence subgraph isomorphism approach in graph pattern matching either finds all the subgraphs that are isomorphic to the given query pattern graph P in data graph G, or returns a subgraph match which is isomorphic to the pattern with the best matched nodes, based on various quality models [18]. Another important concept is maximum common subgraph. A maximum common subgraph of two graphs G and G‘ is G‘‘ that is a subgraph of both G and G‘ and has among all possible subgraphs of G and G‘ the maximum number of nodes. The large data graph leads to huge number of isomorphs, thus calculation of isomorphs is complex and expensive [7]. Graph pattern matching problems defined on subgraph 17
isomorphism are essentially NP- complete [18]. Subgraph isomorphism is often used in various application domains such as chemical or bioinformatics, in which candidate match with respect to an identical structure or design is chosen. However, the complexity of isomorphism based approach infeasible to determine the candidate matches in ―big‖ data graphs such as social graphs. In fact, for Web and social networks analyses, these techniques suffer in finding ―inexact‖ matches [7]. As in these scenarios resultant subgraphs do not have identical structures to the query patterns. An algorithm on homomorphism [25] has also been developed, which finds the inexact matches from the data graph.
Graph Simulation: Graph simulation is to define a binary function on data graph that will find a match for pattern graph on data graph. A graph G matches a pattern P via graph simulation if there exists a binary relation S , where VP and V are the set of nodes in P and G, respectively, such that (1) for each edge (u,v) ⊆ S, u and v must have the same label; and (2) for each node u in P, there exists v in G such that (a) (u,v) ⊆ S, and (b) for each edge (u,u’) in P, there exists an edge (v,v’) in G such that (u’,v’) ⊆ S. Graph simulation can be determined in quadratic time [30][50].
Many application areas crave to find connectivity among the pair of nodes with a constraint on number of hops in-between or path of arbitrary length. Subgraph isomorphism and graph simulation strictly apply edge-to-edge mapping in a data graph, hence result in topological constraints too strictly. As a result, they may fail to capture real connectivity among nodes. For these reasons, traditional computation notions used for solving graph pattern matching problem are revised to accurately and efficiently identify practical and realistic matches in real-life graphs. Semantic based graph pattern matching methods are such methods which are capable to overcome the underline issues in structure based matching methods, discussed in next section.
18
2.3.2. Semantic based approaches Semantic matching approaches attempt to find semantically/conceptually similar graphs, i.e., candidate matches in graphs are based on semantic of nodes and edge. This class of pattern matching techniques improves the traditional notion for graph pattern matching by determining the maximum bounded simulation relation rather than a function and also validates edge to path mapping between pattern and data graph of various bounds [30][34]. Edge-to-path mapping of various bound helps a semantic based graph pattern matching technique to reduce the matching complexity and illustrate patterns in emerging applications which traditional notions of graph matching fail to identify. Bounded Simulation: Bounded simulation is one way of redefining the graph pattern matching from the traditional concept of pattern matching [34]. Traditional approaches prefer edge-to-edge mapping in the process of finding the candidate pattern match, while bounded simulation allows edge-to-path mapping. This overcomes the shortcoming of label equality in pattern matching since it supports search conditions beyond label equality. In the pattern graph, search condition is specified on the node and applied on the content of the edge. The connectivity of a pair of nodes in a patter query graph is denoted by a constant label k or a *, which indicates that the path is bounded within k hops or unbounded respectively. Bounded simulation imposes a weaker topological constraint [34] on the evaluation of the relation for the maximum bounded simulation. Bounded simulation is an enhanced or modified version of graph simulation approached. It has been observed in the various applications that bounded simulation based pattern matching generate more meaningful candidate matches than a structural based matching. It takes cubic time which is in O(|V||E|+|EP||V|2+|VP||V|) for computing
bounded simulation
relation which generates the exact matches in the data graph for a pattern graph. This is algorithmically better of graph simulation [30][34][36]. a) It finds maximum bounded simulation relation rather than functions. b) It maps an edge of pattern P to paths of diverse bounds in a data graph G. c) In contrast to bijective function defined in subgraph isomorphism, it finds binary relation similar to graph simulation. This relation is defined among the nodes of P and G. 19
A data graph G = (V, E, ƒA) matches a pattern graph P = (VP, EP, ƒv, ƒe) via bounded simulation [10], denoted by P ⊴BsimG, if there exists a binary relation S ⊆ VP ×V such that, a) for all node u in VP, there exists v in V such that (u,v) ∊ S; b) for each edge (u,v) ∊ S and each edge (u,v’ ) in P, there exists a nonempty path p from v to v’ in G such that (u’,v’) ∊ S, and len(p) ≤ k if ƒe(u,u’) = k. Incremental Approach: In emerging applications, data graphs are large and dynamic in nature; data graph updations are frequent as operational data is frequently updated (insertions/deletions). Hence in pattern matching, it may result in recompilation of candidate matches from the scratch when data graph is updated. This limitation is overcome by incremental graph pattern matching approach, as this approach tries to find changes ΔM to the previously found matches M in response to changes in data graph ΔG as opposed to batch algorithms that recompute match M starting from scratch [30]. The computation of change value minimizes the match re-computation, as small change in data graph leads to smaller change in matches too. Hence, computing the degree of change is less than to recompute the entire matches for data graphs. In incremental approach, first step is to compute the matches on the entire data graph using batch matching algorithm and then next step is to incrementally identify new matches in response to changes ΔG in data graph [34]. Data graph G, a pattern query graph P, the resultant match M(P,G) computed by using batch algorithm, and changes in data graph ΔG are provided as an input to incremental algorithm. Once input is provided, its aim is to find changes (ΔM) present in previously computed match such that M(P,G ⊕ΔG) = M(P,G) ⊕ΔM, where operator ⊕ specifies the changes in data graph [34]. The data structure used to represent changes in resultant match ΔM is defined in the terms of the area which will get affected when data graph is updated. Let the resultant matches found in G and G⊕ΔG is denoted by M and M‖. The area in G which will get affected is denoted by AFF and it can be defined as difference between M and M‘ in terms of nodes and edges. When a data graph is updated, technique evaluates new matches by using the previous computation of matches. Hence the costs of these approaches are not defined as a function of the input size as compared to that of 20
traditional complexity analysis for batch algorithms. Instead, the algorithmic complexity must be analyzed in terms of |CHANGE|, which indicates the degrees of change in the input data graph and output resultant graph and it is defined as |ΔG| + |ΔM|. Instead of the size of the input, if the cost of an incremental algorithm can be expressed as a function of |CHANGE|, then it is said to be bounded. Complexity analysis when defined in terms of |CHANGE| does not represent the amount of information that must be maintained by an incremental algorithm. As a result, the boundness of an incremental problem is found only in special and ideal cases. In [34] it has been shown that if the worst time complexity of an incremental algorithm is bound by |G|, |P| and |AFF|, then that algorithm is semibounded. Hence when defining the complexity of an incremental algorithm, it must be described in terms of the affected area size of data graph, rather than the entire input data graph size [35] [36]. Incremental bounded simulation when applied in the environment where there are unit updates then it is unbounded. But when applied for batch updates, it is semi bounded and its complexity can be defined in terms of O(|ΔG|(|P||AFF|+|AFF|2)). Strong Simulation: Traditional graph pattern matching approaches inflict topological constraint on data graphs to find relevant candidate matches for a pattern graph [7][18][20][34]. Technique in [30] faces trade-off between the low complexity and capability to retain the topology of data graph in the matches. As an outcome, multiple candidate matches of pattern P are generated having structure different from G. Strong simulation is capable to overcome this limitation by simply imposing two conditions in pattern matching [9][36]. First, introducing the data duality to preserve the upwards edgeto edge mappings and second, data locality to reduce the number of excessive candidate match set. The data duality and locality helps to capture the matched topology in data graph and thus enhances quality of candidate matches by eliminating excessive matches. Strong simulation extends the duality in the process to capture the graph topology and preserve the parent relationship similar to the bounded simulation [34]. Strong simulation has cubic time complexity in capturing the candidate matches; it is equivalent to the bounded simulation. Strong simulation conserves the relationship of parent-child present among the nodes of query pattern graph in its match with the help of preserving labels. In 21
[50] simulation for the analyses of programs was proposed. To capture graph topology more strictly, concept of duality is enforced in simulation. Dual Simulation: [31] Dual simulation is a specific variant of strong simulation. Dual simulation preserves both parent and child relationship. Finding a match for query pattern graph P in data graph G based on dual simulation is denoted by P⊴D G, if a) P ⊴simG with a binary match relation S ⊆ VP ×V, and b) for each pair (u,v) ∊ S and edge (u2,u) in EP, there exists an edge (v2,v) in E such that (u2,v2) ∊ S. Both, strong and dual simulations are based on the concept of data locality in the data graph with respect to pattern query. Locality: A class of graph pattern queries P is said to have data locality if for any graph G and any node v in G, one can decide whether v is a match of a query node u in Q locally, by inspecting only those nodes of G that are within d hops from or to v, where d is determined only by |P|. To define the locality, following notations are required. Match graphs: [31] Consider a relation S ⊆ VP ×V. The match graph with respect to S is a subgraph GS[VS,ES] of G, in which (1) a node v ∊ VS iff it is in S, and (2) an edge (v, v’) ∊ ES if and only if there exists an edge (u, u’) in P with (u, v) ∊ S and (u’, v’) ∊ S. Strong simulation: A data graph G matches a simple pattern P via strong simulation, denoted by P ⊴SsimG, if a) there is a node v in G such that P ⊴D Ĝ [v, dP], and b) the match graph Gs w.r.t. S is a subgraph of Ĝ [v, dP], where dP is the diameter of P, and S is the maximum match for P ⊴D Ĝ [v, dP]. We refer to Gs as a match in G for P. It has been shown in [31] that strong simulation preserves the following topological structures among data graph and pattern graph.
Child relationship. If a match is found for a node v of data graph G to node u in pattern graph P, then each child of u in P must match a child of v in G. 22
Parents. If a match is found for a node v of data graph G to node u in pattern graph P, then each parent of u in P also matches a parent of v in G.
Connectivity. A graph is said to be connected if for each pair of nodes in the graph, there exists an undirected path connecting them. If P is connected, then so are matches of P in G.
Cycles. An undirected (resp. directed) cycle in P must match an undirected (resp. directed) cycle in G.
2.3.3 View based approach Views are efficiently used for querying relational and semi-structured data. View based pattern matching is important and relevant notion for query processing in data integration systems, data mining and warehousing systems etc [38][40]. When data graph is accessed via a set of views; pattern containment is derived for answering query pattern. It positions that a query mapped into pattern query graph can be answered by using a set of defined views, if pattern query graph is enclosed in that set of views [41]. When using views in graph pattern matching for a pattern P, a set of views V ={V1,..Vn} has to be defined on data graph G. Indeed, if materialized pattern views are used, to compute answer to graph pattern query, there is only need to visit views in the set, without accessing the original data graph and views can be easily merged to answer graph pattern query. Traditionally, there are two ways to solve problem of query answering using views: Query Answering and Query rewriting. Query answering simply compute query result by evaluating equivalent query, which only refers to the views present in the view set. While, Query Rewriting reformulate the user query into another equivalent query which only refers to views present in the view set. The derived view set to answer a user query is referred as the containment and for a graph pattern query it is referred as pattern containment. Necessary and sufficient condition to find the answer to a pattern by using a set of views, is the notion of pattern containment. Pattern Containment: [5] Consider a pattern query PS = (VP, EP, ƒv), and a set 𝒱 = {V1,…Vn} of view definitions, where Vi = (Vi, Ei, ƒi). We say that PS is contained in 𝒱, 23
denoted by PS ⊆ 𝒱, if there exists a mapping 𝜆 from EP to powerset 𝒫(∪ 𝑖∈ [1,𝑛] Ei), such that for all data graphs G, the match set Se ⊆ ∪𝑒′∈ 𝜆(𝑒) Se for all edges e ∊ EP. It can be noted that when PS ⊆ 𝒱, for all data graphs G, match to PS can be efficiently computed by using view set 𝒱 (G) only, independent of |G|. Time complexity for computing PS (G) is O (|PS| |𝒱 (G) | + |𝒱 (G) |2 ) if PS ⊆ 𝒱 [41]. In table 2.2, some of the fundamental works are summarized in this direction and highlighted the various aspects of works.
Table 2.2 Summary of Graph Pattern Matching approaches Research Topic, Author(s)
Problem Discussed & Proposed Solution
Subgraph Isomorphism, J. R. Ullmann
Problem: Subgraph isomorphism Solution: Find the subgraphs isomorphic to pattern or returns set of subgraph isomorphic to the pattern with the best matched nodes.
Graph Simulation
Bounded Simulation, W. Fan , J. Li, S. Ma, N. Tang, Y. Wu
Strengths
Limitation/Scope
Uses data locality Unable to find a node pair using
and find exact matches Preserves adjacency and data graph topology.
an arbitrary length path. NP-complete & Intractable Search space increases exponentially with graph size. Matching is neither bounded nor semi-bounded. Impose a strict graph topological constraint andInfeasible to find candidate matches in large and dynamic data graphs. edge Impose a strict topological constraint Does not pose data locality. Does not captures graph topology of patterns in its matches
Constraint Edge to mapping Bounded Solution: Finds exact matches for the pattern graph Problem: Simply based Edge to path Outperforms for cyclic patterns. on connectivity of node mapping pair connected via an Bounded arbitrary length of path. Finds maximum bounded simulation Solution: Each data relation rather than graph node uses a bijective functions. pattern search conditions, search is based on contents or label of the node. Problem: Matching
24
Incremental Approach, W. Fan, X. Wang, Y. Wu,
Problem: Finding Semi bounded. Outperform the counterparts in matching when data Reuse previous pattern matching on dynamic data graphs are frequently computations. graph. updated with small Complexity is Unbounded for matching based on changes given in terms of subgraph isomorphism. the area affected by Cyclic patterns are still a problem Solution: Pattern search updates. is based on the connectivity of node pair in predefined number of hopes in the dynamic data graphs with pattern graph.
Strong Simulation, S. Ma, Y. Cao, W. Fan, J. Huai, T. Wo,
Problem: Capturing the Uses data locality topology of data graphs Find bounded number of matches Solution: Preserves the Due to data topology of graphs locality, effective matched such as parent, in pattern matching connectivity, cycle, by on distributed enforcing the graph systems duality and data locality on data graph for the candidate pattern matches. Problem: Querying Can be easily semi structured data extended to edgeand finding answer to label graphs and queries using views queries without accessing the Readily combined large social graphs with existing distributed, Solution: The pattern compression and containment is incremental introduced for finding techniques for the candidate matches graphs for a pattern graph, pattern containment is according to the view sets. View sets are defined over frequent queries on the DB.
View based approach, W. Fan, X. Wang, Y. Wu,
Unable to response in the large and
dynamic data graph.
To decide which view/ view set to
cache for result generation. Views initiation is problem. View maintenance is a problem.
2.4 Performance Statistics In following some the comparative observation are discussed over the various GPM approaches,
25
Subgraph isomorphism is often used in application domains in which identical candidate pattern matches are desired [7][12], for example chemical [10] or bioinformatics [11]. Though the complexity of subgraph isomorphism makes it impractical to determine the candidate matches in ―big‖ data graphs such as social graphs. Although this approach sometimes also suffers to determine the ‗inexact‘ matches when used for Web [8] and Social Network Analyses [9]. In fact graph simulation and bounded simulation are used to classify Web sites [46] and detect social positions [45]. Moreover, variants of bounded simulation, such as strong or dual in which topology is preserved [31], may induce results that ―approximate‖ isomorphic subgraph. Graph simulation is specialized form of bounded simulation, allowing simple patterns in which (a) mapping is done on the basis of single attribute, i.e. node label, and (b) edge-to-edge mapping only, i.e. all the edges are labeled with 1. In dissimilarity among the NP-hardness of the algorithms based on the notion of subgraph isomorphism, bounded simulation based graph pattern matching is a cubic-time notion. Compared to graph simulation which takes quadratic time, bounded simulation is simpler as: Pattern graph P is classically much smaller than data graph G and number of edges |E| is in O(|V|2) in the worst case. The complexity of the incremental based graph pattern matching algorithm is characterized by the size of the changes in the input data graph, output resultant match, and size of graph patterns and some auxiliary computations which computed from the previous runs. While graph pattern matching according using subgraph isomorphism is neither bounded nor semi-bounded. The incremental algorithms significantly do better than their batch counterparts in small changes in dynamic data graph. In modern day computing, dynamic and huge data graph is a reality, thus incremental pattern matching approach are promising as most of the real life graph are dynamic. The notion used for strong simulation can be adapted for the bounded simulation on regular candidate pattern graphs. The performance of strong simulation is cubictime rather than NP-complete, due to the notion of data locality. Cardinality of resultant matches found using strong simulation is linearly related with size of the 26
data graph. And the resultant match for the pattern depends on the diameter of query pattern graph. As opposed to graph simulation, strong simulation moreover captures the pattern‘s topology in candidate matches. The pattern‘s topology is represented in the form of parents, connectivity and cycles with the help of duality and locality implied on candidate matches. It keeps the complexity measure same as in the simulation. Unlike in the graph simulation, strong simulation‘s locality helps to improve the pattern matching on distributed graphs efficiently. In subgraph isomorphism [18] and strong simulation [31] graph pattern matching is according to the data locality in the data graph. Data locality makes distributed query evaluation easier since only a bounded or limited number of sites may have to be visited. View based GPM techniques can be voluntarily adapted for data graphs and pattern queries with edge labeled. As a matter of fact, an edge-labeled data graph can be changed to a node-labeled data graph by adding a dummy node for each edge of the data graph along with two unlabeled edges which will carry the label of that edge. For data graph G = (V, E, ƒA) and pattern graph P = (VP, EP, ƒv, ƒe), algorithmic complexity of graph pattern matching approaches along with their cardinality is shown in table 2.2.
Table 2.3 Algorithmic Complexity of GPM approaches Matching Approach
Complexity
Cardinality
Subgraph Isomorphism
O ((|V|2+|VP|)(|E|+|EP|))
O(|V||VP| )
Graph Simulation
O ((|V|+|VP|)(|E|+|EP|))
O (|V||VP|)
2
Bounded Simulation
O(|V||E|+|EP||V| +|VP||V|)
O (|V||VP|)
Incremental
O(|ΔG|(|P||AFF|+|AFF|2))
O (|V|)
Strong Simulation
O(|V|(|V|+(|Vq|+|Eq|)(|V|+|E|)))
View Based
2
O( |PS| |𝒱 (G)| + |𝒱 (G)| )
27
O (|V|)
O (|V|)
2.5.
Challenges in Graph Pattern Matching
In recent years, lot of research contribution has been made by researchers from the various application domains on pattern matching. The classification of various pattern matching approaches is according to the semantics and structures [7]. There are various inherent challenges to each of the category, such as indexing of candidate matches, candidate selection, and identifying matching. In below subsection, we are discussing some additional challenges for graph pattern matching: (1). Data Volume, Velocity, Variety and Veracity: Evolution of various social medium such as Facebook, twitter etc., and the various data aspect emerges as challenging forces. Data is generated in tremendous rate by various computing devices in huge amount (in Zeta bytes) and the nature of data generated is complex (structured, semi-structured and unstructured). In real world there are various applications which are dealing with such big data for processing and analytics, however there are a few globally acceptable solutions are designed. Efficiency of graph pattern matching approaches may get affected due to the considerable size of real life graphs and their heterogeneity. Hence, it is important to develop algorithms having low run time complexity and also preserve the accuracy of the matching. (2). Dynamic Graph: The graphs in real life applications are dynamic in nature i.e. these graphs bear constant changes. For instance, in social graphs such as Facebook or twitter, hundreds of new links are generated or destroyed simultaneously. Furthermore, type of edges and labels attached to nodes also changes from time to time, i.e. real life graphs are evolving. Therefore, it is not feasible to run the pattern matching algorithm between two graphs from the start once changes are made. Hence, there is need to develop graph pattern matching algorithm that incrementally compute new matches over evolving graphs with high accuracy.
28
(3). Query Access Pattern: In most of the traditional application user query are static and simple over graph database. In modern data processing, user query are significantly dynamic and complex in nature as predicates in query are complex and relevant data sources are too many for a user query. (4). Index adjacency: For implementation aspect, graph database is stored using index adjacency. In index adjacency, the relationships among data items are stored in index list while connected data graph nodes are physically stored in any other form of database (Relational/XML). (5). View Selection and Maintenance: In view based pattern matching approaches, it is required to have the relevant and updated view for answering pattern queries. Selection of relevant view/views is a complex task and maintaining the data records in views (in materialized views) is also imposing a challenge. (6). Scale/Up gradation of existing graph: In case of any expansion or extension, the entire data graph is required to be realigned and pattern query need to be tuned accordingly.
29
Chapter 3 ANSWERING PATTERN QUERY USING INCREMENTAL VIEWS The notion of incremental view is closely related with relation database view, which gives a way of portraying a view over the database. A database user hires a series of queries and from this series of queries most frequently used queries are selected and stored in a pre-compiled form. These pre-established query commands are kept in database dictionary and result set of these queries is called views. A view definition captures the query semantic and useful in semantic based pattern matching. Hence, materializing these views is one alternative in this direction, as query results are materialized with definition to support the structure and semantic based pattern matching. In materialized views result set is stored in the memory in advance and can be cached whenever it is needed. View based query processing has been addressed by means of two basic approaches : (1) Query rewriting, rewriting the user query into another equivalent query, which only refer to materialized views as data source, (2) Query answering, aims at directly answering the query based on the view extensions. In this thesis, we are exploring pattern query answering using views. The problem of pattern query answering using view set is to find another equivalent query which is equivalent to user query, only refer to views i.e. finding an efficient method of answering a query by using materialized view set which is previously defined over the database, rather than accessing the whole database [38][40][41][51].
3.1.
Graph Pattern Matching using Views
For answering graph pattern queries using views defined on data graph, the notion of pattern containment is used. Pattern containment is a subset of query containment which states that a match to pattern query graph from the data graph can be found by using a set of views defined on same data graph if and only if query pattern graph definition is contained in the definition of views [41]. Pattern containment [41] is a property which 30
defines the mapping of a query pattern on the data graph. For a pattern query P and a set of views 𝒱 = {V1,…Vn} defined on data graph G, graph pattern matching problem when defined on views is to find another query pattern graph Q such that Q is identical to P for G and Q only refer to views present in V and it should not access G. If such a query Q exists, then set 𝒱 (G) has to be found which contains only those views from set V that can compute result to P. Consider a set V of view definitions defined on G and a pattern query P whose match has to be found. We say that P is contained in V, denoted by P ⊆ V if there exist a mapping 𝜆 from EP to powerset 𝒫(∪𝑖 ϵ [1,𝑛] Ei), such that for all data graphs G, the match set Se ⊆ ∪e’ ϵ 𝜆(e) Se’ for all edges e ϵ EP. When P ⊆ V, for all graphs G, match M to query pattern P can be found by using views V(G) only, unrestrained by |G|. In [22], they have proved that pattern containment is necessary and sufficient condition to resolve whether a pattern query can be answered by using views and if yes then which views to be selected. They have provided an algorithm to find match M of P using V(G) in O( |P| |V(G)| + |V(G)|2 ). For answering graph pattern queries the maximum pattern containment is desired. Although to find the optimal view set from which a pattern query can be answered gives rise to minimum containment problem. The problem of minimum pattern containment [41] is to find a minimum subset of views from view set that contains pattern query. The identification of relevant views to answer a pattern query is difficult task on static data graph while for large and dynamic data graphs it is a complex problem. A multi-step and no-linear process is required to identify the optimal views for answering a query in the large and dynamic data graph. One of the key challenges for using views in graph pattern matching is how to cope with the sheer size and the dynamic nature of real-life graphs. As real life graphs are frequently updated, it is cost prohibitive to re-compute matches for a pattern starting from scratch when graph is updated [51]. Hence there is need to incrementally find new matches for the pattern by using maximum of the previous computations. To conduct graph pattern matching by exploiting views, first of all following questions have to be answered. (1) How to make a decision whether a match to query pattern graph can be found by using views defined on that data graph? (2) If so, which views in view set 31
should we choose to answer pattern query? (3) How to efficiently compute match for a pattern from view set? (4) How to find updated match when there are frequent changes in data graph? In this thesis, we are proposing some notions to answer the fundamental question in answering pattern queries using views. The notions are related to the some of the established concepts of graph pattern matching. This thesis scrutinizes few relevant questions for answering graph pattern queries using graph views. We propose a notion based on incremental approach which revises graph simulation and pattern containment. (1). Traditional notion of query containment is extended to deal with views, known as pattern containment. According to it answer to pattern query graph can be found by using a set of views defined over database if and only if pattern query is contained in the view set. (2). To decide which views from the view set has to be used for answering pattern query; minimum containment problem is identified for pattern containment. The minimum pattern containment set is to find a minimum subset of view set from the defined view set such that the minimum subset contains pattern query. (3). To find match to a pattern query, the notion of graph simulation is revised. Instead of finding match from data graph, matches are found from view set by mapping nodes and edges of the pattern to the view definition. (4). Incremental pattern matching approach is used to find new matches of the pattern from the updated data graph. It finds new matches for the updated data graph by finding changes in the previously computed matches. Hence, once matches are computed on the entire graph through a batch matching algorithm, then applying this approach will incrementally identify changes in matches or new matches in response to changes in data graph.
32
3.1.
Graph Pattern Matching using Incremental Views
In [34], an approach id discussed that incrementally finds matches in dynamic environment. But key bottleneck one main drawback is that this algorithm will fail to outperform as compared to batch algorithms in case where only few edges are updated. And also this algorithm will underperform in case of space complexity since every time when this algorithm is executed it will use the actual data graph which in real life is very large. To overcome this problem, we may use the notion of views. The algorithm provided in [38][40][41] uses views defined on data graph to find matches without accessing the actual data graph. This approach is only applicable for static environment where data graphs and pattern graphs are not frequently updated. Hence, these traditional notions have to be revised so as to counter real-life graphs. Incremental view is integration of both notions, on updation in data graph algorithm identify the affected views and incremental update the same. Traditionally, a view definition cannot be updated once it is defined, modification in database cannot easily reflect. In this scenario, database administrator can do only two things, either new view definition is added or previously defined view is deleted to signify the changes in the database. Hence the concept of incremental views can be easily adapted in the environment where there are frequent changes in the database. In case of graph database, database is referred as data graph. In present scenario, graph database is present in every possible field. For example graph pattern matching is fundamental for analysis of social networks. Social networks modeled as graphs, a node of the graph denotes a person, and an edge between nodes indicates some relationship between them, e.g., in LinkedIn, Twitter and Facebook. Graph pattern matching is used in a routine to identify social communities and social positions. These social network graphs are frequently updated. It is cost proscriptive to compute the query result again and again starting from scratch when changes are inflicted on data graph. This motivates to adopt incremental view based algorithms for graph pattern matching.
33
3.3.
Proposed Algorithm
We present the notion that will use views to incrementally find matches to a pattern in dynamic environment i.e. where there are frequent changes in data graph. Every time when data graph is updated, these changes must be seen in views. Hence as the name suggests, views present in the view set have to be incrementally updated as soon as data graph is updated. Figure 1 represents the flow chart representation of the approach and the notations used in the flow chart are described in Figure 2. The proposed approach is multi-step and non-linear, works in four phases as described below(a). Initialization Phase: In this phase, user query and database is initialized to pattern query P and data graph G respectively. Some queries from previous access pattern is chosen and materialized as view set V. Data graph G and pattern query P are taken as an input for initializing view set V. Finally, match set Se' is initialized for all the views in V. Match sets Se' represents materialized views, i.e. it contains the match for edge e of the pattern in data graph. (b).Merging Phase: The work of this phase is to determine the minimum containment view set, i.e., it identifies the minimum number of views V' from initialized view set V that are required to answer a pattern query P. Here V' represents the minimum containment view set. Next, distance Matrix (DM) is computed for all the views in V'. Here, distance matrix DM represents the reachability of nodes/labels stored in a view definition. In DM, values 1 represent the direct link between two nodes in view graph. Further in case of any updation/deletion of an edge in data graph G may lead to updation of specific value in distance matrix DM to 0. (c). Pattern Containment Phase: As the name suggests, the work of this phase is to check whether identified minimum containment view set which is found in the previous step is fulfilling the pattern containment property or not. Since pattern containment property is necessary and sufficient condition for answering graph pattern matching using views, hence this phase checks whether result to pattern 34
query can be found by using view subset V' or not. In first step, for each edge e in pattern query P a match is found in view present in V' using mapping function 𝜆. Match set Se for all the edges e in P is initialized with tuple present in Se' according to its matched values in V'. Finally, resultant subgraph M of data graph G for pattern query P on V' is initialized by merging all the evaluated match sets Se. Next step checks whether match set for any edge is found empty or not. In case of empty match set for an edge, the algorithm terminates since pattern containment property is not satisfied. Otherwise the algorithm resumes to next phase for verifying any updation/deletion in data graph.
(d).Edge Deletion Phase: This phase of the algorithm is incremental based approach. The work of this phase is to find whether the resultant match will get affected or not by the deletion of an edge in the data graph. Then accordingly find the new resultant match for the pattern. This phase of the algorithm will only run when an edge is deleted from data graph. In the first step, it is checked whether an edge is deleted from data graph G or not. If an edge is not deleted, then algorithm will as it is return the resultant match M which is found in previous step and terminates. Otherwise, the algorithm will identify all the views in V' that will get affected by an edge deletion in data graph G. It then identifies the affected area in that affected views. Subsequently distance matrix DM and match set Se of affected views are updated. Finally, the resultant match M is modified by merging updated match sets Se. For a pattern query P, edge deletion phase runs iteratively for each updation/deletion on data graph G. Every time when an edge is deleted from the data graph G, the condition of pattern containment has to be checked, since deletion of an edge may result in invalidation of minimum containment view set V'.
35
Figure 3.1: Proposed approach of Answering Pattern Query
Table 3.1: List of notations in flowchart (figure 3.1) Symbols
Notations
G
Data graph
P
Graph Pattern Query
V = {V1,…Vn}
Set of view definitions
V'
Subset of V
Se'
Match set for edge e of views in V
𝜆 DM Se AFF M
Mapping Function Distance Matrix Match set for edge e in P Affected area Resultant match of P in V'
36
3.3.1. Example The working of the proposed algorithm is explained by considering the company data base of employee record into the data graph depicted in figure 3.2(a). In which a portion of a network is represented as a data graph G, where each node denotes a person and each edge denotes collaboration among them. Label attached to each node is employee name and designation. For example, designations defined on nodes include project manager (PM), coordinator (CRD), programmer (PRG), database administrator (DBA) and tester (TST)). The data graph is structure to store the attribute and data records in the operational database. The user query required to be mapped into a graph query, called pattern query as shown in figure 3.2(b). In example the user query is to ‗Find the list of employees who are PM, DBA, PRG in the Company‘. View set defined on data graph is also shown in figure 3.2(c).
Figure 3.2: Data Graph ‗G‘, Pattern query ‗P‘ and View set ‗V‘ In the initialization phase, the data graph ‗G‘, pattern query ‗P‘, view set ‗V‘ are initialized. The query access pattern is used for identifying frequent queries on the database by various users. The database profile and user profile are collectively used to
37
derive such pattern. The view set ‗V‘, initially consists of such frequent database queries. The frequent queries are now materialized into views, so that better processing can be achieved. Match set Se' for each edge ‗e‘ of views in V as in figure 3.3 (a) is initialized. Se' consists of all the edge values from the data graph for each of the views. Hence in the first run of current algorithm views of V are materialized in the form of Se'. Next phase is merging phase, which evaluates the distance matrix ‗DM‘, according to the minimum and relevant view set V‘ from V. Here relevant view set V‘ will contain view V1 and V4 of view set V. The distance matrix denotes the measure of the reachability of various nodes in the views with respect to the pattern query. The various paths and their length indicate the values in the distance matrix, such as length between two nodes will be the distance in matrix. The calculated distance matrix for each of the relevant views is
Figure 3.3: Match set Se and distance matrix DM‘ on view set V‘ shown in the below figure 3.3 (b). As shown in view V1 definition, nodes PRG and DBA is one unit distance path reachable from PM and similarly in view V4 node PRG and DBA are one distance far from each other. In the current approach distance matrix for each of relevant or selected views is evaluated and used in subsequent phase to determine 38
the pattern containment. The concept of distance matrix also gives the capability to the current approach to determine the possible updation on the data graph, as on the deletion and updation of edge in data graph the values in DM is updated. In the pattern containment phase of the algorithm, based on the pattern of a query the views are identified according to a mapping function. Hence, there will be a mapping function 𝜆 from EP of P to edges in present in set V'. Once mapping is done, their corresponding match sets Se are initialized with the help of match set Se' of matched views in V'. With the help of these matches sets Se, resultant match M to a pattern can be found easily by merging Se, without accessing the actual graph G as shown in figure 3.4. Based on this concept, effect of change in data graph G can be detected easily only on views by observing distance matrixes DM of views. Distance matrix for V1 and V4 is given in figure 3.3(b). Now, suppose an edge (Ben, Walt) is deleted from data graph G due to some operation reasons. Once an edge is deleted, affected view has to be found and accordingly match sets Se' and distance matrix DM for that view has to be updated. Here, V4 will be affected since label of deleted edge is matched with definition of V4.Updated match set Se' of V4 is represented in figure3.5 (a). At the end, updated match sets for all the edges in P are merged to form resultant match M. figure 3.5 (b) represents the resultant match M.
Figure 3.4: Resultant match ‗M‘
Figure 3.5: Updated match set Se‘ and resultant match M‘
39
Chapter 4 EXPERIMENTAL ANALYSIS
The algorithm for ‗answering graph pattern query using incremental views‘ is referred as Match and consist of four modules. For the Match algorithm, inputs are: data graph G, pattern graph P and set of definition of views VDefinition, and output: maximum match M in VDefinition for pattern graph P. The notions for finding the intermediate solution of each of the module are described in details, below; Phase 1: Initialization phase – The work of this phase is to initialize all data structures. Once user query is provided as an input, it is represented in the form of P =(Pfrom,Pto), where (Pfrom,Pto) represents the directed edge from node id Pfrom to node id Pto. VDefinition = (VDfromLabel, VDtoLabel) stores the definition of views defined on data graph in tabular form where each row represents a directed edge from node label id VDfromLabel to node label id VDtoLabel. Match is found for all the edges present in VDefinition from G and is then stored in match set Se'= (SfromValue,
StoValue), where (SfromValue,
StoValue)
represents a directed edge from node value SfromValue to node label id StoValue. Input: Data graph G = (V,E,ƒA), Pattern query graph P = (VP,EP,ƒv,ƒe), set of view definitions VDefinition = (VDfromLabel, VDtoLabel) Output: Materialized views as match set Se' = (SfromValue, StoValue) 1. for all edges e = (VDfromLabel, VDtoLabel) in VDefinition 2. find match Se' from G
Phase 2: Merging phase – Table LabelInViews = ( LVfrom, LVto, InView) is provided as an input to this phase. This table specifies directed edge from node label_id LVfrom to node label id LVto is present in which view id InView. The work of this phase is to find containment view set CView line 1-4, which contain the id‘s of those views such that merging them will give pattern query result. Once containment view set 40
CView is found, distance matrix DM is computed for all views present in containment set, line 5-6. Input: LabelInViews = ( LVfrom, LVto, InView) Output: Containment view set CView, Distance matrix DM 1. for all edges e in EP 2. for all entries in LabelInViews 3. if Pfrom = LVfrom & Pto = LVto 4. CView = InView 5. for all views in CView 6. compute distance matrix DM of CView
Phase 3: Pattern Containment phase – The work of this phase is to find resultant match M for the pattern query from match set Se' of views present in containment view set CView. For all the edges of the pattern query, match value is found from Se' and accordingly match value Se is initialized, line 1-6. At the end of this phase all the match values Se for each edge e of pattern graph is merged to form resultant match M,line 7-8. Input: User pattern query P=(Pfrom,Pto), match set Se' = (S'fromValue, S'toValue) computed in first phase Output: Resultant Match M 1. for all edges e in EP 2. for all edges of views in CView 3. find match Se for e from Se' for all view in CView 4. if Se = Ф 5. pattern is not contained in view set 6. exit 7. for all e in EP 8. M = Se
Phase 4: Edge deletion phase – The work of this phase is to iteratively check for the updation on in the data graph. In case of updation, information of edge is provided to the as,edgeDelete=(EDfromLabel,EDtoLabel,EDfromValue,EDtoValue), where
node label id‘s of directed edge is represented by
(EDfromLabel,
EDtoLabel)and value of this deleted edge is represented by (EDfromValue, 41
EDtoValue). If no, it will return match set found in previous phase, line 1-2. If yes then first of all it will find the id of view which will affected by this deletion of edge denoted by AFview, line 3-5 and then accordingly match set Se of that view will be updated, once updated match set is found resultant match is updated accordingly, line 68. Distance matrix DM of affected view AFview will also get updated, line 9. At the end, resultant match is again found by merging match values Se for all the edges of the pattern graph, line 10-14. Input: Edge to be deleted edgeDelete = (EDfromLabel, EDtoLabel, EDfromValue, EDtoValue) Output: Updated resultant match M 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
4.1.
if edgeDelete = Ф return M for all entries in LabelInViews if EDfromLabel = LVfrom & EDtoLabel = LVto AFview = InView for all entries in Se of AFview if EDfromValue = SfromValue & EDtoValue = StoValue delete(SfromValue, StoValue) update Dm of AFview clear M for all e in EP M = Se return M exit
Experimental Result
This section describes the list of factors that have an impact on the execution time of the algorithm. The factors that are explored includes the size of the user pattern query, i.e. number of edges present in the user pattern query graph, number of edges deleted per execution of the algorithm and the size of the containment view set affected by deletion of edges. All experiments were run on a machine having following hardware specifications- 3rd generation Intel® Core™ i5-3210M CPU, 4GB of DDR3 RAM. Following is the list of software‘s used for the implementation of the algorithm: 42
Neo4J: It is a graph database management system developed by Neo Technology. In Neo4j, everything is stored in the form of nodes and edges connecting these nodes. Labels and attributes can be attached to either nodes or edges or both. Neo4j uses its own graph query language, known as Cypher. Eclipse: It is an integrated development environment used for developing java applications and may also be used for developing applications in other programming languages. Neoclipse: It is a tool to view, edit and explore databases created in Neo4j. It visualizes the database in the form of graph by using JAVA. The data set used for the implementation is provided in previous chapter. Figure 4.1 represents the graphical representation of our dataset/ database in Neo4j.
Figure 4.1: Data Graph of Company database (drawn in Neo4j interface) The implementation of our algorithm is written in JAVA. The implementation of our algorithm is based on the definition of views; hence definition of views must be stored in some file so that the algorithm can easily find out containment view set. We have used an XML file to store definition of views and an XML parser is designed to read data from the file. Views are created on some frequently asked queries. We have created four views on our data set. Figure 4.2 to 4.5 represents the graphical representation of views created on data graph present in Figure 4.1 and also definition. 43
View 1: Stores all the PRG and DBA working under PM.
Figure 4.2: Materialized view 1 View 2: Stores all the PRG and TST interacting with DBA and TST must work under PRG.
Figure 4.3: Materialized view 2
44
View 3: Stores all the TST working under CRD who are working under PM.
Figure 4.4: Materialized view 3
View 4: Stores all the PRG and DBA who are interacting with each other.
Figure 4.5: Materialized view 4
Once views are created, algorithm has to run to find the match for user query by using views. Following portion of the experiment determines the effect of various factors on the runtime of the algorithm. 45
(a). Query Answering(time) vs Pattern query ‘P’ Size: In answering a graph query, pattern size is important parameter, as number of distinct nodes in a pattern query are related to overall response time of the algorithm. Query answering, consider the time elapsed between query submission to first response from system and size of pattern query is simply signifying the number of nodes in the pattern query. X-axis of the plot represents the pattern query size, i.e. number of edges present in pattern query graph and Y-axis represents the execution time of the algorithm in milliseconds. The overall response time by the algorithm is represented in both scenarios, in case of deletion of an edge and without deletion. 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
When edge not deleted When edge ddeleted
1
2
3
4
5
Pattern query size
Figure 4.6: Effect of Pattern Query ‗P‘ size on Execution time
46
(b). Query Answering(time) vs Number of edges deleted: The frequency of deletion of an edge in data graph draws a major impact on the execution of the algorithm. It also affects the size of the containment view set, since view which will get affected by the deletion of an edge also has to found. Figure 4.7, X-axis of the plot represents the number of edges deleted in one phase of the algorithm and Y-axis represents the execution of the algorithm in milliseconds. When plotting the edge deletion frequency, against the execution time of the algorithm, the pattern query remains same for all the values.
Execution time 35000 30000 25000 20000 15000 10000
Execution time
5000 0 1
2
3
4
5
Number of edges deleted
Figure 4.7: Effect of number of edges deleted on Execution time
47
Figure 4.8 represents the ID of views which will get affected by the deletion of a particular edge..
Figure 4.8: List of affected views when edge deleted
48
Chapter 5 CONCLUSION
Finding a match to a user query in data graph is a key problem in many data mining applications, which are computationally searching pattern in data graph or graphstructured data. Consequently, several efficient graph pattern matching techniques have been developed in past few years. Mutually, these matching techniques signify decades of work by researchers from diverse research communities. Regardless of variations in inherent properties of graphs, data sets, and algorithms, common premise have emerged. More or less, these traditional graph matching techniques are based on isomorphism and simulation and often fail to capture structure and semantic similarity, both of data graphs for pattern searching. Data graph of real-life applications are huge and dynamic in nature, which a traditional graph pattern matching technique unable to cope with. This thesis briefly summarized various approach of graph pattern matching, irrespective of the application areas and presents a revised notion of simulation, and develops a novel incremental view (materialized views) based graph matching approach. To begin with, notion of Incremental graph matching is revised. Incremental approach is designed for graph pattern matching based on the notion of bounded simulation, which finds maximum bounded simulation relation and maps edges in a pattern to a path of various bounds. The key point of this simulation strategy is to find inexact matches. This notion is based on the fact that, for pattern matching in frequently updated data graph (by a sequence of edge deletions) an incremental algorithm works efficiently. This algorithm determines the matches for query pattern in the data graph and maintains these candidate matches in occurrences of an update on data graph, instead of re-computing the all the matches computed initially. In this thesis, the notion of incremental matching is settled as semi-bounded and its complexity is independent of the size of data graph. After that, notion of view based graph matching is revised for pattern matching according to derived pattern containment. Containment represents a set of views, capable to derive the candidate matches for a pattern query. For a user query and ‗set of views‘, now using 49
views the problem of graph pattern matching
finds another query which must be
equivalent to user pattern query and it must refer views in the view set. If such a query exists, then user pattern query can be answered using views without accessing the actual data graph. To improve the above discussed approaches of pattern matching for dynamic data graphs, In this thesis, a integration of two important notions is proposed for graph pattern matching, as follows: (1) incremental matching, which incrementally finds updated matches in frequently updated data graph, and (2) view based matching that works as a batch algorithm, to find matches of user query from the view set, without accessing the data graph. In this thesis, we have identified some of the challenges and open issues in the pattern matching in the dynamic graph, discussed next.
Listed Challenges and Open Issues The research on answering graph pattern queries based on views is in evolving phase with few open research challenges. Some of the key challenges in massive analytics are inherently data related such as, data size, heterogeneity, data quality and uncertainty etc. Based on literature review in the thesis, some observation recognized as potential research directions, briefly listed below; (1). By considering more efficient batch and incremental matching along with advanced features, such as indexing, summarization and data compression techniques. A significant improvement on the lower bounds of traditional graph pattern matching techniques is achieved. (2). It is observed that, many researchers preferred an incremental graph pattern matching approach for patterns evaluation in DAGs, but similar notions is not yet applied for cyclic pattern, which probably certain improved performance. (3). Advanced ranking mechanism can be adapted for ranking candidate matches in strong simulation. The ranked top-k matches are retrieved and further the same can be adapted for regular expressions on edge types.
50
(4). Investigation of optimization techniques is required for incremental graph pattern matching by simply leveraging traditional approaches for batch algorithms. (5). In view based pattern matching approaches, deciding on which view or views are relevant and to be cached for answering frequent used pattern queries is complex problem. Thus an efficient mechanism based on subgraph isomorphism in views graph can be designed. (6). The key challenge is the problem identification of relevant views from view set in answering graph pattern queries in dynamic data graph. As in dynamic data graph, the relevant view set become redundant due to updation or deletion on data graph. Similarly, updation on data graph leads to re-run of algorithm for identification of affected view definitions and revaluation of matches. Finally, issue of handling deletion and updation in data graph separately. Deletion in data graph when applying view based graph pattern matching approach involves updating match sets of views and their distance matrixes. While updating a label of an edge or node in data graph involves re-computation of distance matrix. Further, both of them require identification of views in view set. All the related challenges are largely handled by combining incremental view based approach with distributed and compression methods. (7). All the existing views based graph pattern matching techniques are completely based on derived pattern containments in data graph. Thus, an efficient algorithm for computing the maximal containment is required, when containment is not present in available view set.
51
REFERENCES [1]. Jiawei Han, Micheline Kamber and Jian Pie, ―Data Mining: Concepts and Techniques,‖ Morgan Kaufmann, Elsevier, 2011 [2]. Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth, "From data mining to knowledge discovery in databases," in AI magazine, 1996, vol. 17, no. 3, p. 37. [3]. Brachman, Ronald J., and Tej Anand, "The process of knowledge discovery in databases," in Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, 1996, pp. 37-57. [4]. Angles, Renzo, and Claudio Gutierrez, "Survey of graph database models," in ACM Computing Surveys: CSUR, 2008, vol. 40, no. 1, p.1. [5]. Diane J. Cook, Lawrence B. Holder, ―Mining Graph Data,‖ John Wiley & Sons, 2007 [6]. I. Robinson, J. Webber and E. Eifrem, ―Graph Databases,‖ O‘Reilly 2013 [7]. Gallagher, Brian, "Matching structure and semantics: A survey on graph-based pattern matching," in Proceedings of the Conference of the American Association for Artificial Intelligence: AAAI FS 6, 2006, pp. 45-53. [8]. Cho, Junghoo, Narayanan Shivakumar and Hector Garcia-Molina, "Finding replicated web collections," in ACM SIGMOD Record: ACM, 2000, vol. 29, no. 2, pp. 355-366. [9]. W. Fan, ―Graph Pattern Matching Revised for Social Network Analysis,‖ in ACM 19th International Conference on Database Technology: ACM, 2012, pp. 8-21. [10]. James, C. A., D. Weininger, and J. Delany, "Daylight theory manual daylight version 4.82. Daylight Chemical Information Systems,‖ 2003. [11]. Huan, Jun, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink, Jan Prins, and Alexander Tropsha, "Mining protein family specific residue packing patterns from protein structure graphs," in Proceedings of the eighth annual international conference on Research in computational molecular biology: ACM, 2004, pp. 308315. [12]. Yan, Xifeng, Philip S. Yu, and Jiawei Han, "Substructure similarity search in graph databases," in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 766-777. ACM, 2005. [13]. He, Huahai, and Ambuj K. Singh, "Closure-tree: An index structure for graph queries," in Proceedings of the 22nd International Conference on Data Engineering: ICDE IEEE, 2006, pp. 38-38. [14]. Williams, David W., Jun Huan, and Wei Wang, "Graph database indexing using structured graph decomposition," in IEEE 23rd International Conference on Data Engineering: ICDE IEEE, 2007, pp. 976-985.
52
[15]. Miller, John A., Lakshmish Ramaswamy, Krys J. Kochut, and Arash Fard, "Research Directions for Big Data Graph Analytics," in IEEE International Congress on Big Data: BigData Congress IEEE, 2015, pp. 785-794. [16]. Ma, Shuai, Jia Li, Chunming Hu, Xuelian Lin, and Jinpeng Huai, "Big graph search: challenges and techniques," Frontiers of Computer Science (2014): 1-12. [17]. Conte, Donatello, Pasquale Foggia, Carlo Sansone, and Mario Vento, "Graph matching applications in pattern recognition and image processing," in Proceedings. 2003 International Conference on Image Processing: ICIP IEEE, 2003, vol. 2, pp. II-21. [18]. J.R. Ullmann, ―An Algorithm for Subgraph Isomorphism,‖ in Journal of ACM, 1976, Vol. 23, No. 1, pp 31-42. [19]. De Nardo, Lorenzo, Francesco Ranzato and Francesco Tapparo, "The subgraph similarity problem" in IEEE Transactions on Knowledge & Data Engineering 5, 2008, pp. 748-749. [20]. Bunke, Horst, "Graph matching: Theoretical foundations, algorithms, and applications," in Proceedings of Vision Interface, 2000, vol. 2000, pp. 82-88. [21]. Chen, Li, Amarnath Gupta, and M. Erdem Kurul, "Stack-based algorithms for pattern matching on dags," in Proceedings of the 31st international conference on Very large data bases, VLDB Endowment, 2005, pp. 493-504. [22]. Shasha, Dennis, Jason TL Wang, and Rosalba Giugno, "Algorithmic and applications of tree and graph searching," in Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems: ACM, 2002, pp. 39-52. [23]. C. Martínez, G. Valiente, ―An algorithm for graph pattern-matching,‖ in Proceedings of Fourth South American Workshop on String Processing, 1997, vol. 8, pp. 180-197 [24]. Fan, Wenfei, Philip Bohannon, "Information preserving XML schema embedding," in ACM Transactions on Database Systems: TODS, 2008, vol. 33 no. 1, p. 4. [25]. Fan, Wenfei, Jianzhong Li, Shuai Ma, Hongzhi Wang, and Yinghui Wu, "Graph homomorphism revisited for graph matching," in Proceedings of the VLDB Endowment 3, 2010, no. 1-2, pp. 1161-1172. [26]. X. Jeffery, J. Cheng, B. Ding, S. Philip and X. Wang, ―Fast Graph Pattern Matching,‖ in IEEE 24th International Conference on Data Engineering, 2008, pp. 913-922. [27]. De Nardo, Lorenzo, Francesco Ranzato, and Francesco Tapparo, "The subgraph similarity problem," in IEEE Transactions on Knowledge & Data Engineering 5: TKDE IEEE, 2008, pp. 748-749. [28]. T. Milo and D. Suciu, ―Index structures for path expressions,‖ in Database Theory—ICDT, Springer Berlin Heidelberg, 1999, pp. 277-295. 53
[29]. M.R. Henzinger, T. Henzinger and P. Kopke, ―Computing Simulations on Finite and Infinite Graphs,‖ in Foundations of Computer Science, 1995, pp. 453-462. [30]. W. Fan, J. Li, S. Ma, N. Tang and Y. Wu, ―Graph Pattern Matching: From Intractability to Polynomial Time,‖ in Proceedings of the VLDB Endowment, 2010, vol. 3, no. 1. [31]. Ma, Shuai, Yang Cao, Wenfei Fan, Jinpeng Huai and Tianyu Wo, "Strong simulation: Capturing topology in graph pattern matching," in ACM Transactions on Database Systems: TODS, 2014, vol. 39, no. 1, p. 4 [32]. Tian, Yuanyuan, Jignesh M. Patel, "Tale: A tool for approximate large graph matching," in Data Engineering, ICDE IEEE 24th International Conference on: IEEE, 2008, pp. 963-972. [33]. Fan, Wenfei, Jianzhong Li, Shuai Ma, Nan Tang and Yinghui Wu, "Adding regular expressions to graph reachability and pattern queries," in Data Engineering (ICDE) IEEE 27th International Conference on: IEEE, 2011, pp. 39-50. [34]. Fan, Wenfei, Xin Wang, and Yinghui Wu, "Incremental graph pattern matching," in ACM Transactions on Database Systems: TODS, 2013, vol. 38, no. 3, p. 18. [35]. Ramalingam, Ganesan and Thomas Reps, "An incremental algorithm for a generalization of the shortest-path problem," in Journal of Algorithms on the computational complexity of dynamic graph problems, 1996, vol. 21, no. 2, pp. 267–305. [36]. Henzinger, Monika R., Thomas A. Henzinger, and Peter W. Kopke, "Computing simulations on finite and infinite graphs," in Foundations of Computer Science, Proceedings., 36th Annual Symposium on, : IEEE, 1995, pp. 453-462. [37]. Wang, Changliang and Lei Che, "Continuous subgraph pattern search over graph streams," in Data Engineering, ICDE'09. IEEE 25th International Conference on: IEEE, 2009, pp. 393-404. [38]. Halevy and Alon Y, "Theory of answering queries using views," in ACM SIGMOD Record 29, 2000, no. 4, pp. 40-47. [39]. Wang, Haixun, Hao He, Jun Yang, Philip S. Yu and Jeffrey Xu Yu, "Dual labeling: Answering graph reachability queries in constant time," in Data Engineering, ICDE'06. Proceedings of the 22nd International Conference on: IEEE, 2006, pp. 75-75. [40]. Halevy and Alon Y, "Answering queries using views: A survey," in The VLDB Journal 10, 2001, no. 4, pp. 270-294. [41]. Fan, Wenfei, Xin Wang and Yinghui Wu, "Answering graph pattern queries using views," in IEEE 30th International Conference on Data Engineering: ICDE IEEE, 2014, pp. 184-195. [42]. Fard, Arash, M. Usman Nisar, Lakshmish Ramaswamy, John A. Miller and Matthew Saltz, "A distributed vertex-centric approach for pattern matching in
54
[43].
[44].
[45].
[46].
[47].
[48].
[49].
[50]. [51].
massive graphs," in International Conference on Big Data: IEEE, 2013, pp. 403411. Lenzerini and Maurizio, "Data integration: A theoretical perspective," in Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems: ACM, 2002, pp. 233-246. Ma, Shuai, Yang Cao, Jinpeng Huai and Tianyu Wo, "Distributed graph pattern matching," in Proceedings of the 21st international conference on World Wide Web: ACM, 2012, pp. 949-958. Venkataramani, Venkateshwaran, Zach Amsden, Nathan Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding et al, "Tao: how facebook serves the social graph," in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data: ACM, 2012, pp. 791-792. Zeng, Kai, Jiacheng Yang, Haixun Wang, Bin Shao, and Zhongyuan Wang, "A distributed graph engine for web scale RDF data," in Proceedings of the VLDB Endowment, 2013, vol. 6, no. 4, pp. 265-276. Low, Yucheng, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola and Joseph M. Hellerstein, "Distributed GraphLab: a framework for machine learning and data mining in the cloud," in Proceedings of the VLDB Endowment 5, 2012, no. 8, pp. 716-727. Malewicz, Grzegorz, Matthew H. Austern, Aart JC Bik, James C. Dehnert, Ilan Horn, Naty Leiser and Grzegorz Czajkowski, "Pregel: a system for large-scale graph processing," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data: ACM, 2010, pp. 135-146. B. Bhargavi, K. P. Supreethi, ―Graph Pattern Mining: A Survey of Issues And Approaches,‖ in International Journal of Information Technology and Knowledge Management, 2012, vol. 5, no. 2, pp. 401-407 R. Milner, ―Communication and Concurrency‖, New York etc.: Prentice Hall, 1989, vol. 84 Saltz, Matthew Wyatt, and J. A. Miller, "A fast algorithm for subgraph pattern matching on large labeled graphs," PhD diss., University of Georgia, 2013.
55
LIST OF PUBLICATIONS 1. Komal Singh and Vikram Singh, ―Graph Pattern Matching: A brief survey of challenges and research directions,‖ in Proceedings of the 10th INDIAcom, International Conference on Computing for Sustainable Global Development: INDIAcom, IEEE, 16-18 March 2016, New Delhi (Accepted) 2. Komal Singh and Vikram Singh, ―Answering Graph Pattern Query using Incremental Views,‖
in
Proceedings
of
the
International
Conference
on
Computing,
Communication and Automation: ICCCA, IEEE, 28-30 May 2016, Greater Noida (Accepted)
56