Efficient Path Query and Reasoning Method Based on ... - Springer Link

3 downloads 0 Views 640KB Size Report
Jiang Yang(姜 洋)1, 2,Feng Zhiyong(冯志勇)1,Wang Xin(王 鑫)1,. Ma Xiaoning(马晓宁)2. (1. School of Computer Science and Technology, Tianjin University, ...
Trans. Tianjin Univ. 2015, 21: 278-283 DOI 10.1007/s12209-015-2460-6

Efficient Path Query and Reasoning Method Based on Rare Axis* Jiang Yang(姜

洋)1, 2,Feng Zhiyong(冯志勇)1,Wang Xin(王 Ma Xiaoning(马晓宁)2

鑫)1,

(1. School of Computer Science and Technology, Tianjin University, Tianjin 300072, China; 2. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China) © Tianjin University and Springer-Verlag Berlin Heidelberg 2015

Abstract:A new concept of rare axis based on statistical facts is proposed, and an evaluation algorithm is designed thereafter. For the nested regular expressions containing rare axes, the proposed algorithm can reduce its evaluation complexity from polynomial time to nearly linear time. The distributed technique is also employed to construct the navigation axis indexes for resource description framework (RDF) graph data. Experiment results in DrugBank and BioGRID show that this method can improve the query efficiency significantly while ensuring the accuracy and meet the query requirements on Web-scale RDF graph data. Keywords:graph; path; regular expression; complexity; distribution

The core idea of semantic web is to build a machineunderstandable data networks by giving formal semantics to the data on the web[1]. As the data base of semantic web, resource description framework(RDF) graph data have reached ten billion triples driven by the linked data movement[2]. RDF is a special kind of graph data model, whose characteristic is that when an ontology layer semantics is expressed, the edges of RDF graph can be configured as nodes, i.e., the set of edge labels may have a nonempty intersection with its set of nodes, which ensures that a query can get more implicit information by inference[3]. As a traditional graph data model, path query has always been the focus and difficulty of the research: on one hand, it can express a path query of any length, especially the navigational path query of unlimited length; on the other hand, its high evaluation complexity makes it difficult to meet the query requirements on the large scale RDF graph data[4, 5]. Nested regular expression(NRE)[6] is the latest path query expression with polynomial time complexity. It can achieve RDF schema(RDFS) semantic inference on the original RDF graph, and has the equivalence relationship with property path under the existential semantics[7], i.e, the two kinds of expressions can be transformed to each

other equivalently under the existential semantics[8]. However, for other path query expressions, they either do not involve RDFS semantic inference or achieve the reasoning by the closure-oriented method. For example, the representative implementation systems are Gleen[9], SPARQLeR[10], SPARQ2L[10] and PSPARQL[11]. Moreover, NRE is also proved to have strong expression ability and polynomial time evaluation complexity[12]. However, with the explosive growth of RDF graph data, even the polynomial time evaluation complexity cannot meet the path query requirements on the web-scale RDF graph data[13, 14]. In order to further improve the query efficiency of NRE on the large-scale RDF graph data, we first use distributed technology to build the navigation axis index of NRE on RDF graph data, and then make a statistics on the frequency of NRE navigation axis in the index. As the frequency of different navigation axes appearing in the NRE index on RDF graph is different, and when those low frequency navigation axes appear in the NRE expression, the NRE expression can be cut from these axes to reduce the search scope of RDF graph, thereby reducing the time and space evaluation complexity of NRE. In this paper, we propose a new concept of rare

Accepted date: 2014-08-18. *Supported by the National Natural Science Foundation of China(No. 61373035 and No. 61100049), National High Technology Research and Development Program of China (“863” Program, No. 2013AA013204), Fundamental Research Funds for the Central Universities(No. 3122014C018 and 3122015C022), and Scientific Research Funds Supported by Civil Aviation University of China(No. 09QD02X). Jiang Yang, born in 1984, male, Dr, lecturer. Correspondence to Wang Xin, E-mail: [email protected].

Jiang Yang et al: Efficient Path Query and Reasoning Method Based on Rare Axis

axis and a new query algorithm for NRE based on rare Output: Navigation axis index of NRE for RDF graph axis. Furthermore, we carry out experiments on datasets Map ([key], [value]): from DrugBank and BioGRID respectively. Read input triple(string s, string p, string o)

1

key=s value=(next::p, p)

NRE distributed index scheme

key=s value=(edge::o, p) key=p value=(edge 1::o, s) key=p value=(node::s, o) key=o value=(next 1::p, s) key=o value=(node 1::s, p) Output([key], [value]) Reduce (key, [value]) Read input (key, [value]) For each value in [value] valuelist=valuelist+value value=valuelist+(self::key, key) Output(key, value) Algorithm 2 MapReduce algorithm for NRE index frequency Input: Navigation axis index of NRE for RDF graph Output: Statistical frequency of each navigation axis of NRE Map ([key], [value]) Read input (key, valuelist) for each axis in valuelist key=axis value=1 Output([key], [value] Reduce (key, value) Read input (key, [value]) sum=0 for each value in [value] sum=sum+value key=key value=sum Output(key, value) -

NRE is a regular expression with branched structure, which defines seven kinds of RDF navigation axes for each moving step, as shown in Fig. 1. It can express all the RDFS inference rules, and achieve the query and reasoning according to the user’s requirements on the original RDF graph, which avoids calculating the closure of RDF graph and has polynomial time evaluation complexity. Its index structure is an RDF graph linked list with navigation axes, as shown in Fig. 2. To further reduce the NRE’s evaluation complexity and make it meet the query requirements on large-scale RDF graph data, we use MapReduce[15] framework to build NRE index and make a statistics on the frequency of NRE navigation axis, which will be described in Algorithms 1 and 2. Definition 1 NRE grammar exp  axis | axis :: a (a  U ) | axis ::[exp ] |

exp / exp | (exp|exp ) | exp* | exp 

where axis ∈ {self, next, next 1, edge, edge 1, node, node 1}. -

Fig. 1

-

Navigation axes in NRE

2

Fig. 2

NRE index structure for RDF graph

Algorithm 1 MapReduce algorithm for NRE index Input: Triples of RDF graph

Evaluation algorithm

The idea of the algorithm based on axis is as follows: select a navigation axis in NRE as rare axis according to the statistical results in Section 1 and the definitions in this section; cut an NRE expression from the rare axis into two parts, and make a reverse transformation for the first part; calculate the evaluation results of the two parts according to the automata algorithm from the endpoints; merge the above evaluation results of the two parts that share the same endpoints. —279—

Transactions of Tianjin University

Vol.21 No.3 2015

2.1

Automata algorithm for NRE Definition 6 Navigation axis index for RDF graph The NRE automata algorithm comes from some Let G be an RDF graph and its corresponding navitemporal and dynamic propositional logic algorithms, as gation index of NRE is AG = (terms(G), δG), where well as some automata theories[16]. The semantics of NRE {v | v  terms (G )}, if e  self :: v  has the same defined manner as the existential semantics,  G  {a | (v, p, a)  G}, if e  next :: p  1 i.e., returning the set of pairs connected by NRE on the {b | (b, p, v)  G}, if e  next :: p RDF graph. Its semantics is defined as exp G . {c | (v, c, p )  G}, if e  edge :: p Definition 2 Definitions related to RDF graph  1 {d | (d , v, p )  G}, if e  edge :: p  G ( v, e )   For an infinite set U composed by URI, an RDF tri{h | ( p, v, h)  G}, if e  node :: p ple t = (s, p, o) ∈ U × U × U, where s is subject, p is { f | ( p, f , v)  G}, if e  node 1 :: p  predicate, and o is object. An RDF graph G is a set of finite t, terms(G) represents the set of all the element where |AG| is the size of navigation axis index of G, nodes in G, and |G| represents the size of G, i.e., the |AG | =  | l' | , |l'| is the length of navigation axis. number of t in G. Definition 3 Nondeterministic finite automaton (NFA) An NFA A is a quintuple(S, Σ, δ, I, F). It consists of a finite set of states S, a finite set of input symbols Σ(Σ ∩ S=), a transition function δ : S × Σ → 2S, an initial state set I  S and a termination state set F  S. ε-NFA is an NFA with ε transition, and its corresponding transition function is δ : S × (Σ ∪ {ε})→ 2S. Definition 4 Depth of NRE Let exp be an NRE, then the i-depth of exp Di(exp)(i ≥ 0) is defined as follows: D0 (exp1 )  D0 (exp2 ) if exp  exp1 / exp2 or exp  exp1 exp2 D0 (exp )  D0 (exp1 ) if exp  exp1 * or exp  exp1 exp if exp  axis, or axis :: a, or axis :  exp  D0 (exp ) (i ≥ 1)  Di (exp )=axis:: exp   Di 1 (exp )

where |exp| is the size of exp, |exp|=|D0(exp)|;D|exp| is the depth of exp, D|exp|=d iff Dd+1(exp)=. Definition 5 NFA description for NRE Let exp be an NRE expression and its corresponding ε-NFA is Aexp=(Q, D0(exp), δA, I, F), which can recognize the regular expressions constructed by D0(exp). For example, Fig. 3 is the corresponding ε-NFA of NRE expression β, where |A exp | is the size of exp, |Aexp | =  | l | , and |l| is the length of transition edge. (q , l , q ) A

Fig. 3 Corresponding ε-NFA of 

—280—

(v ,l' ,v')  G

Definition 7 Product automaton Let AG=(terms(G), δG) be the navigation axis index for G, Aexp =(Q, D0(exp), δA, I, F) is the corresponding ε-NFA of NRE exp. Therefore, the product automaton P = (terms(G) × Q, D0(exp), δP, I × terms(G), F×terms(G)). With axis∈{self, next, next 1, edge, edge 1, node, node 1}, v ∈ terms(G), q ∈ Q, δP is defined as follows: δP((v, q), axis)=  G (v, axis :: a)   A (q, axis) a terms (G )

δP ((v, q), axis::a)=

G (v, axis :: a )   A (q, axis::a) δP ((v, q), axis::[exp])=   G (v, axis :: a )   A (q, axis::[exp]) a  terms (G )  exp  label (a ) δP((v, q), ε)={v}×δA(q, ε)

For example, for β = node::a/(next::[next::b])*/ next::c/(edge::[next::d])+, D0(β) = {node::a, next:: [next::b], next::c, edge::[next::d]}. Its corresponding Aβ is shown in Fig. 3. Fig. 4 is an RDF graph. According to Definition 7, the product automaton P' = AG' × Aβ is constructed by β and G', which is shown in Fig. 5, where only reachable states are shown for the sake of space. From Fig. 5, we can see that β ∈ label(b), because there exists (r1, r7)∈   G . Since P has |AG| × |Aexp| states at most, the construction time of AG×Aexp is O(| AG | . | Aexp |) . In the following, we will give the reachability evaluation algorithm of NRE with the given starting point, which is divided into two parts. Algorithm 3 Reachability algorithm of NRE Input: RDF graph G, NRE expression exp and starting point a Output:Evaluation result set

Jiang Yang et al: Efficient Path Query and Reasoning Method Based on Rare Axis

LABEL(G, exp) For each axis::[exp'] ∈ D0(exp) do Call LABEL(G, exp') Construct Aexp and assume that q0 is its initial state and F is its set of final states Construct AG × Aexp For each state (u, q0) that is connected to (v, qf) in AG ×Aexp LABEL(u)=LABEL(u) ∪ {exp} EVAL(G, exp, a) For each u ∈ terms(G) do LABEL(u)= Call label(G, exp) For each state(b, qf), which is reachable from (a, q0) in AG × Aexp, with qf ∈ F Result=Result ∪ {(a, b)}

Fig. 4

Fig. 5

RDF graph G′

Product automation P′

Theorem 1 The time complexity of EVAL(G, exp, a) is O(| exp | . | G |) Proof Given an element a ∈ term(G), return all the elements b ∈ term(G), such that (a, b) ∈ exp G . In the worst case, the time complexity of depth-first traversal is: O (  | l | . | l' |) = O (   | l | . | l' |)  ( q ,l , q' ) A ( v ,l' ,v' ) G

O(

( q ,l ,q' ) A ( v ,l' ,v' ) G



( q , l , q' ) A

|l |.



( v , l' , v' ) G

| l' |) = O (| AG | . | Aexp |)

tion complexity of NRE is proportional to the square of RDF graph G. With the explosive growth of RDF graph data, even polynomial time complexity cannot meet the query requirements of NRE. Therefore, we propose a new evaluation algorithm for NRE based on rare axis as below, which can reduce its evaluation complexity from polynomial time to nearly linear time. 2.2 Evaluation algorithm for NRE based on rare axis In fact, Algorithm 3 does not consider the frequency of navigation axes. According to the statistical results of navigation axes frequency calculated by Algorithm 2, the frequency of different navigation axes appearing in NRE index for RDF graph is different. When a navigation axis with low frequency appears in an NRE expression, it can cut the NRE through this navigation axis to reduce the search scope of RDF graph. Accordingly, we propose a new evaluation algorithm for NRE based on rare axis. Definition 8 Rare axis Let G be an RDF graph and exp an NRE expression, then a navigation axis r is a rare axis iff r ∈ D0(exp), D(r)=0 and P(r)≤m, where m ∈ N, m ≥1, and P(r) is the frequency of navigation axis r appearing in AG. As the definition of rare axis is a relative concept, m can be given any positive integer as the threshold of rare axis theoretically. To accommodate the growing webscale RDF graph data and maintain the relative stability of the m value, we set m = lg|G|. When there exists more than one rare axis in NRE, the rare axis with minimum value of P(r) is taken as the cut-off point and the NRE expression is cut into two parts, then a reverse transformation is made on the first part. The reverse transformation function is defined as follows. Definition 9 Reverse transformation function trans 1 (exp)  trans 1 (exp2 ) / trans 1 (exp1 ), if exp  exp1 / exp2  1 1 trans (exp1 ) / trans (exp2 ), if exp  exp1 / exp2 [trans 1 (exp )]* , if exp  exp * 1 1  trans 1 (exp ) :: [exp ], if exp  exp :: [exp ] 1 2 1 2  1 1 axis :: a | axis , if exp  axis :: a | axis,  axis  {next, edge, node}  axis :: a | axis, if exp  axis 1 :: a | axis 1 ,  axis 1  {next 1 , edge 1 , node 1}  exp, if exp  self | self :: a 

Considering O|AG| = O|G| and O|Aexp| = O|exp|, the time complexity of EVAL(G, exp, a) is O(| exp | . | G |) . Without the starting point, the EVAL algorithm needs to make O|G| times of depth-first traversal, which results in the following Theorem 2. Theorem 2 The time complexity of EVAL(G, exp) where a ∈ terms(G). Theorem 3 For any NRE expression exp, there exis O(| exp | . | G | 2 ) . From Theorem 2, we can see that the current evalua- ists an reverse expression trans 1(exp). —281—

Transactions of Tianjin University

Vol.21 No.3 2015

According to Definition 9 and the property of NRE navigation axis, Theorem 3 is obviously true. Because for any navigation axis axis∈{next, edge, node}, there exists its reverse navigation axis axis 1 ∈ { next 1, edge 1, node 1} and the self is reflexive. So the evaluation algorithm for NRE based on rare axis can make good use of the symmetry property of NRE navigation axes to improve its query efficiency. The new algorithm is defined as follows, where GEVAL(G, r) represents the matching results of rare axis r in NRE navigation axis index AG for RDF graph G, and its complexity is O|G|. Algorithm 4 Evaluation algorithm for NRE based on rare axis Input: RDF graph G and NRE expression exp Output: Evaluation results of exp in G REVAL(G, exp) According to Definition 8, obtain rare axis set R from NRE exp, if R=, then go to the second or third step Return EVAL(G, exp) Get M = GEVAL(G, r0), where r0 ∈ R and P(r0) = min{P(r)|r ∈ R } Split NRE exp from r0 to expA and expB, and calculate trans 1(expA) For each (a, b) ∈ M

Tab. 1 and Fig. 6 show that the REVAL algorithm can improve the query efficiency of NRE significantly while ensuring the same evaluation results. It is obvious that the REVAL algorithm not only reduces the search scope of RDF graph, but also shortens the length of NRE query expressions. The data sizes of DrugBank and BioGRID used in the experiment were 80 MB and 100 MB respectively. Besides, Fig. 7 shows that with the growth of RDF graph data, the growth of time consumed by REVAL is nearly linear, while the current EVAL is close to the index. This result is consistent with Theorem 4. Tab. 1 Experimental comparison between EVAL and REVAL on DrugBank NRE

Rare axis

Result

EVAL/s

Q1

Refludan

17

3.8

REVAL/s 58

Q2

Erbitux

23

4.5

62

Q3

Pulmozyme

12

2.1

70

Q4

Viscozyme

27

2.5

77

Q5

Enbrel

6

1.6

55

Q6

Angiomax

35

3.2

89

Q7

Eligard

9

2.0

72

Q8

Leuplin

13

2.3

81

A=EVAL(G, trans 1(expA), a) -

B=EVAL(G, expB, b) Result = Result + A × B Return Result Theorem 4 The time complexity of REVAL(G, exp, R) is O | exp | . | G | . | m | . Because m=lg|G|, the above theorem shows that when there exists a rare axis in the NRE expression, the evaluation algorithm for NRE based on rare axis can reduce its time complexity from O(| exp | . | G |2 ) to O(| exp | . | G | . lg | G |) . Since the complexity of this algorithm is almost linear, it can better meet the path query requirements on web-scale RDF graph data.

3

Fig. 6 Experimental comparison between EVAL and REVAL on BioGRID

Experimental results and analysis

The experimental data sets are from DrugBank[17] and BioGRID[18]. The selected queries in the experiments are NRE expressions with rare axes. The experiments were conducted in a distributed environment and each node was configured with Intel Q8400, 4G RAM, hard disk 500G/7200, Ubuntu 10.04, JDK1.6. Experiment 1 Efficiency of the algorithm for NRE based on rare axis —282—

Fig. 7

Time consumed by EVAL and REVAL on DrugBank for NRE query with rare axis

Experiment 2 Efficiency of constructing NRE distributed index for RDF graph Fig. 8 is the experimental result of constructing NRE distributed index by MapReduce algorithm on real RDF data set from BioGRID. Because each RDF triple in G corresponds to fixed and finite NRE navigation axes,

Jiang Yang et al: Efficient Path Query and Reasoning Method Based on Rare Axis

the size of NRE index is linear to the size of RDF graph.

and Agents on the World Wide Web, 2010, 8(4): 255-270. [7] Arenas M, Conca S, Pérez J. Counting beyond a Yottabyte, or how SPARQL 1. 1 property paths will prevent adoption of the standard [C]. In: Proceedings of the 21st International Conference on World Wide Web. Lyon, France, 2012. [8] Jiang Y, Feng Z Y, Wang X et al. Adapting property path for polynomial-time evaluation and reasoning on semantic web [J]. Transactions of Tianjin University, 2013,

Fig. 8 Performance of constructing NRE index using MapReduce on different number of nodes

19(2):130-139. [9] Alkhateeb F, Baget J F, Euzenat J et al. Constrained regular expressions in SPARQL [C]. In: International Conference

Fig. 8 shows that by adding nodes, the distributed on Semantic Web and Web Services. Las Vegas, USA , technology does improve the efficiency of the construc2008. tion of NRE index and makes the construction time keep [10] Gelade W, Gyssens M, Martens W. Regular expressions a weak linear growth.

with counting: Weak versus strong determinism [J]. SIAM

4

Conclusions

Journal on Computing, 2012, 41(1): 160-190. [11] Alkhateeb F, Baget J-F, Euzenat J. Extending SPARQL

with regular expression patterns (for querying RDF) [J]. According to the characteritics of RDF graph data Web Semantics: Science, Services and Agents on the World and NRE navigation axes, a new concept of rare axis is Wide Web, 2009, 7(2): 57-73. proposed and a new evaluation algorithm for NRE based [12] Barceló P, Pérez J, Reutter J L. Relative expressiveness of on rare axis is designed. For NRE containing rare axes, nested regular expressions [C]. In: Proceedings of the 6th the proposed algorithm can reduce its evaluation comAlberto Mendelzon International Workshop on the plexity from polynomial time to nearly linear time. The Foundations of Data Management. Ouro Preto, Brazil, experiments further verify the performance of the algo2012. rithm.

References [1] Berners-Lee T, Hendler J, Lassila O. The semantic web [J]. Scientific American, 2001, 284(5): 34-43.

[13] Furche T, Weinzierl A, Bry F. Labeling RDF graphs for linear time and space querying. In: Semantic Web Information Management[M]. Springer, Berlin, Germany, 2010.

[2] Bizer C, Heath T, Berners-Lee T. Linked data—The story

[14] Przyjaciel-Zablocki M, Schätzle A, Hornung T et al.

so far [J]. International Journal on Semantic Web and

RDFPath: Path query processing on large RDF graphs with

Information Systems, 2009, 5(3): 1-22. [3] Horrocks I, Patel-Schneider P F, van Harmelen F. From

MapReduce [C]. In: Proceedings of the 8th Extended Semantic Web Conference. Heraklion, Greece, 2011.

SHIQ and RDF to OWL: The making of a web ontology

[15] Dean J, Ghemawat S. MapReduce: Simplified data

language [J]. Web Semantics: Science, Services and

processing on large clusters [J]. Communications of the

Agents on the World Wide Web, 2003, 1(1):7 -26.

ACM, 2008, 51(1): 107-113.

[4] Jiang Y, Feng Z Y, Wang X. A multikeyrank model based

[16] Zauner H, Linse B, Furche T et al. A RPL through RDF:

on ontology for large-scale semantic data [J]. Chinese

Expressive navigation in RDF graphs [C]. In: Proceedings

Journal of Electronics, 2014, 23(1): 119-123.

of the Fourth International Conference on Web Reasoning

[5] Koschmieder A, Leser U. Regular path queries on large graphs [C]. In: Proceedings of the 24th International Conference on Scientific and Statistical Database Management. Crete, Greece, 2012. [6] Pérez J, Arenas M, Gutierrez C. nSPARQL: A navigational language for RDF [J]. Web Semantics: Science, Services

and Rule Systems. Brixen, Italy, 2010. [17] Open Data Drug & Drug Target Database [EB/OL]. http://www. drugbank. ca/, 2013. 08. [18] Biological General Repository for Interaction Datasets [EB/OL]. http://thebiogrid. org/, 2013. 07. (Editor: Wu Liyou)

—283—

Suggest Documents