A Probabilistic Approach to Navigation in Hypertext - Department of ...

9 downloads 112 Views 263KB Size Report
trail that satisfies a HQL query (technically known as the model checking ... can be read nonsequentially, in contrast to traditional text, for example in book form, ...
A Probabilistic Approach to Navigation in Hypertext Mark Levene University College London Gower Street London WC1E 6BT, U.K. email: [email protected]

George Loizou Birkbeck College Malet Street London WC1E 7HX, U.K. email: [email protected]

Abstract

One of the main unsolved problems confronting Hypertext is the navigation problem, namely the problem of having to know where you are in the database graph representing the structure of a Hypertext database, and knowing how to get to some other place you are searching for in the database graph. Previously we formalised a Hypertext database in terms of a directed graph whose nodes represent pages of information. The notion of a trail, which is a path in the database graph describing some logical association amongst the pages in the trail, is central to our model. We dened a Hypertext Query Language, HQL, over Hypertext databases and showed that in general the navigation problem, i.e. the problem of nding a trail that satises a HQL query (technically known as the model checking problem), is NPcomplete. Herein we present a preliminary investigation of using a probabilistic approach in order to enhance the eciency of model checking. The avour of our investigation is that if we have some additional statistical information about the Hypertext database then we can utilise such information during query processing. We present two di erent approaches. The rst approach utilises the theory of probabilistic automata. In this approach we view a Hypertext database as a probabilistic automaton, which we call a Hypertext probabilistic automaton. In such an automaton we assume that the probability of traversing a link is determined by the usage statistics of that link. We exhibit a special case when the number of trails that satisfy a query is always nite and indicate how to give a nite approximation of answering a query in the general case. The second approach utilises the theory of random Turing machines. In this approach we view a Hypertext database as a probabilistic algorithm, realised via a Hypertext random automaton. In such an automaton we assume that out of a choice of links, traversing any one of them is equally likely. We obtain a lower bound of the probability that a random trail satises a query. In principle, by iterating this probabilistic algorithm, associated with the Hypertext database, the probability of nding a trail that satises the query can be made arbitrarily large.

Keywords. Hypertext, navigation, trail, probabilistic automata, query language

1 Introduction Hypertext NIEL90], or more generally Hypermedia, is text (which may contain multimedia) that can be read nonsequentially, in contrast to traditional text, for example in book form, which has a single linear sequence dening the order in which the text is to be scanned. Hypertext presents several di erent options to readers, and the individual reader chooses a particular sequence at the time of reading. The inspiration for Hypertext originates from Bush's visionary idea of the memex machine BUSH45]. A Hypertext database LEVE94] (see also STOT89, FRIS92]) is formalised as a directed graph BUCK90] (called the reachability relation) whose nodes (called states) represent textual units of information (called pages) and whose arcs (called links or transitions) allow the reader to navigate from an anchor node to a destination node. The structure of a Hypertext database changes over 1

time, and thus di ers from traditional databases, such as relational databases ULLM88], which have a regular structure dened by a relational database schema. In order to query pages we also associate with a Hypertext database an information repository, which maps each state of the reachability relation to a set of conditions that are true at that state. In logical terminology a Hypertext database is a temporal structure EMER90] (see also STOT92] whereby temporal logic is used to formalise the notion of a Hypertext database). The process of traversing links and following a trail of information in a Hypertext database is called navigation (or alternatively link following). In graph-theoretic terms a trail in a Hypertext database is a path in the database graph (we allow a node to occur more than once in a path paths are called walks in BUCK90]). Navigating through a Hypertext database leads to the problem of getting \lost in hyperspace" NIEL90], which is the problem of having to know where you are in the database graph and knowing how to get to some other place that you are searching for in the database graph. From now on we will refer to this fundamental problem as the navigation problem. In a more general context of a Hypertext system the process of nding and examining the text pages associated with destination nodes is called browsing NIEL90]. The browser is the component of a Hypertext system that helps users search for the information they are interested in by graphically displaying the relevant parts of the database and providing contextual and spatial cues with the use of navigational aids such as maps and guides FRIS92]. A set of tools that aid navigation by performing a structural analysis of the database graph is described in RIVL94]. In a previous paper LEVE94] we studied the navigation problem in Hypertext via a Hypertext Query Language, HQL, which is a subset of propositional linear temporal logic EMER90]. The navigation problem can be viewed as the process of nding a trail which satises the user's query in logical terminology this is known as the model checking problem. In LEVE94] we have shown that the navigation problem in Hypertext cannot be solved eciently, since model checking in HQL is, in general, NP-complete GARE79]. Despite this negative result we have shown therein that in some special cases model checking can be done in polynomial time. Finally, we concluded that in practice, if we do not wish to restrict the user's query language, approximate solutions to the model checking problem must be sought after. The aim of this paper is to present a preliminary investigation of using a probabilistic approach in order to enhance the eciency of model checking. We present two di erent approaches and give some preliminary results for each approach. The avour of our investigation is that if we have some additional statistical information about the Hypertext database then we can utilise such information during query processing. The rst approach utilises the theory of probabilistic automata RABI63, PAZ71]. In this approach we view a Hypertext database as a probabilistic automaton, called a Hypertext Probabilistic Automaton (HPA). The arcs of the reachability relation of the HPA are labelled with probabilities that are computed from statistical information related to the frequency that users choose to navigate from a state si to an adjacent state sj via the arc (si  sj ), when browsing the text associated with si . In this case we investigate the language accepted by such a HPA with a cut-point of  in the open interval (0,1) (i.e. a trail, viewed as a string, is accepted by the HPA if the probability of the trail is greater than ). We show that this language is both a star-free MCNA71] and an unambiguous STEA85] regular language. In addition, if the reachability relation is 2total, i.e. there are always at least two choices to navigate from any given state si, then this language is also nite. The second approach utilises the theory of random Turing machines LEEU56, RABI76, GILL77, JOHN90]. In this approach we view a Hypertext database as a probabilistic algorithm, realised via a Hypertext Random Automaton (HRA), in which random choices are made during navigation assuming that all states adjacent to a state si are equally likely, i.e. they are uniformly distributed. In this case we investigate the probability that a random trail consisting of k states, 2

constructed by k random choices while navigating through the Hypertext database, is a trail that satises a given query. In principle, by iterating this probabilistic algorithm m times, where m is an appropriate natural number, the probability of nding a trail that satises the query can be made arbitrarily large. For the special case when the reachability relation is -regular, i.e. the number of choices to navigate from a state si is the constant , we give a lower bound on the number of trails in the reachability relation that satisfy a given query, assuming that there exists at least one trail that logically implies the query. This lower bound is used to obtain a lower bound on the probability that a random trail satises the query. In both approaches a Hypertext database can be viewed as a nite homogeneous Markov chain (or simply a Markov chain) KEME60, FELL68]. In the rst approach the initial and transition probabilities are determined by statistical information and in the second approach these probabilities are assumed to be uniform. The following notation will be used throughout the paper.

Denition 1.1 (Notation) The set of all natural numbers is denoted by !.

The set of all strings of nite length, over a nite nonempty alphabet , is denoted by  and

+ =  ; feg, where e denotes the empty string. A language L over  is a subset of +.

Concatenation of two strings w and z will be denoted by wz and xk will denote the concatenation of x with itself k times with k  0. In addition, a string y is a substring of a string w if there exist strings x and z such that w = xyz  if the symbol a 2  is a substring of w we write a 2 w. We denote the cardinality of a set, S, by jSj and the size of a string, i.e. the number of symbols in w, by jjwjj in particular jjejj = 0. The set of all strings in  having size k 2 ! is denoted by k . The nite powerset of S is denoted by PowerSet(S). The open interval from 0 to 1 is denoted by (0,1), the closed interval from 0 to 1 is denoted by 0,1] and the half-open interval from 0 to 1, which is open to the right, is denoted by 0,1). Finally, we abbreviate \if and only if" to i . The rest of the paper is organised as follows. In Section 2 we introduce our underlying model for Hypertext that was presented fully in LEVE94]. In Section 3 we introduce the notion of Hypertext nite automata and give a characterisation of the subclass of regular languages accepted by such automata. In Section 4 we dene Hypertext probabilistic automata and characterise the subclass of regular languages accepted by such automata. In Section 5 we dene a probabilistic query language for Hypertext and investigate the complexity of model checking in this query language. In Section 6 we dene Hypertext random automata based on the concept of random Turing machines and introduce the notion of the success probability of a Hypertext random automaton. In Section 7 we investigate query answering with respect to Hypertext random automata by giving a lower bound on the success probability of a Hypertext random automaton. Finally, in Section 8 we give our concluding remarks.

2 Hypertext Databases and Trail Queries Herein we give the necessary background from LEVE94] concerning Hypertext Databases and trail queries. We use the following graph-theoretic terminology where R is a binary relation in a set S (R corresponds to a directed graph in which loops are allowed BUCK90]). Whenever (si  sj ) 2 R we say that sj is adjacent to si  we call (si  sj ) an arc. The degree of an element si 2 S is the number of states sj 2 S that are adjacent to si . The following special types of binary relations will be used throughout the paper. 3

Denition 2.1 (A complete relation) A relation R in S is complete (cf. complete graph BUCK90]) if 8s1  s2 2 S, (s1  s2 ) 2 R. Denition 2.2 (A regular relation) A relation R in S is -regular (or simply regular) if 9 2 f1,: : :,jSjg such that 8si 2 S, the degree of si is  (cf. regular graph BUCK90]). Denition 2.3 (A total relation) A relation R in S is total if 8s 2 S, 9t 2 S such that (s t) 2 R.

Denition 2.4 (A 2total relation) A relation R in S is 2total if 8s 2 S, 9t1 t2 2 S such that t1 = 6 t2 and (s t1) (s t2 ) 2 R. We will now dene a Hypertext database and a simplied version of HQL see LEVE94] for the full version of HQL.

Denition 2.5 (Hypertext database) Let  be a countably innite set of primitive propositions (also called conditions) and S = fs1  : : :  sN g be a nite set of states.

A Hypertext database H is an ordered pair, , where I is a mapping from S to PowerSet() and R is a total relation in S. I is called the information repository of H and R is called the reachability relation of H. Intuitively, the information repository, I, represents the conditions that are true at each given state s 2 S, where a state represents a page of text (or hypermedia). Furthermore, the reachability relation, R, represents the possible trails a user can traverse during the process of navigating through the Hypertext database.

Denition 2.6 (Trail) A trail in R is a nite sequence of states (i.e. a string of states) corresponding to a path in R. We let #T denote the number of states of a trail T in R (we call #T the count of T) and let Tj ], j 2 f1,: : :,#Tg, denote the j th state of T if j < 1 or j > #T, then Tj ] is undened.

Denition 2.7 (Trail query) A trail query (or simply a query)  is an expression of the form c1 ^ : : : ^ cn n 2 ! where ci  i 2 f1 : : :  ng, is a condition.

The semantics of a trail query  with respect to a Hypertext database H = (H is omitted if it is understood from context) are specied by T j=  i 8i 2 f1 : : :  ng 9j 2 f1 : : :  #T g such that ci 2 I(Tj ]) where T is a trail in R. The Hypertext Query Language (HQL) is dened to be the set of all possible trail queries. In temporal logic EMER90] terminology each condition ci in a trail query  is interpreted as asserting that \sometimes" ci is true. We note that HQL has been simplied for the purpose of this paper the full version of HQL, as dened in LEVE94], also supports supports the temporal operators \nexttime" and \naltime" and general Boolean queries over individual pages. 4

Denition 2.8 (Model checking) A Hypertext database H = is a model of a trail query, , if 9T in R such that T j=  with respect to H. The model checking problem is the problem of deciding whether a Hypertext database is a model of a trail query.

Even though HQL, as dened above, is only a restricted subset of the full query language dened in LEVE94], the following result concerning the intractability of its model checking was shown in LEVE94].

Theorem 2.1 The model checking problem for HQL trail queries is NP-complete. 2

3 Hypertext Databases and Finite Automata We now give the necessary background from LEVE94] concerning Hypertext Finite Automata (HFA) and investigate the subclass of regular languages accepted by HFA. We assume that the reader is familiar with the basic theory of nite automata HOPC79, PERR90].

Denition 3.1 (Star-free languages) A regular language is star-free if it can be generated

from a nite set of strings by repeated applications of the Boolean operations together with concatenation. A comprehensive treatment of several equivalent denitions of star-free regular languages can be found in MCNA71]. The following denition is needed in order to state an alternative characterisation of star-free languages.

Denition 3.2 (Aperiodic regular languages) A regular language L  + is aperiodic if 9n 2 !(n > 0) such that 8x y z 2 , xynz 2 L i xyn+1z 2 L. The following theorem states the equivalence of star-free and aperiodic regular languages PERR90, Theorem 6.1].

Theorem 3.1 A regular language is star-free i it is aperiodic. 2 Denition 3.3 (Hypertext nite automata) The Hypertext nite automaton (HFA) representing a Hypertext database, H = , is a quintuple of the form MH = (A, Q, , Qi , Qf ) (or simply M whenever H is understood from context), where A = f1 : : :  N g is a nite alphabet, with N =jSj. Q = S = fs1  : : :  sN g is the set of states of the HFA.   Q A Q is a transition relation, where (sj  (sj ) sk ) 2  i (sj  sk ) 2 R, with being a one-to-one and onto mapping from Q to A such that (si ) = i. Qi = Q and Qf = Q are the sets of initial and nal states, respectively. If Q = , then M is called the empty Hypertext nite automaton. The following result was shown in LEVE94]. 5

Theorem 3.2 L(MH ) is a star-free language, where H = is a Hypertext database. 2 Denition 3.4 (The language accepted by a HFA) A path in the HFA, M = (A, Q, , Qi, Qf ) (or simply a path if M is understood from context) is a nonempty sequence f(si  (si ) si+1 )g (i 2 f1 : : :  n ; 1g) of transitions in , with n 2 !. (We also say that there exists a path from

(s1  (s1 ) s2 ) to (sn;1  (sn;1 ) sn ).) If s1 = sn , then the path is a cycle. The string w = (s1 ) (s2 ) : : : (sn ) is called the label of the path, s1 is called its origin and sn is called its end. The length of the path is n. The sequence of states < s1  : : :  sn > is called a trace of w in M (or simply a trace of w if M is understood from context). A path is successful if its origin is in Qi and its end is in Qf . A string w 2 A+ is said to be accepted by M if it is the label of some successful path. The language accepted by M, denoted by L(M), is the set of all strings accepted by M. We note that by Denition 3.3 all paths in a HFA are successful, since Qi = Qf = Q.

Denition 3.5 (Unambiguous nite automata) A nite automaton M is unambiguous STEA85] if 8w 2 L(M) there exists only one trace of w in M. A regular language L is unambiguous if there exists an unambiguous nite automaton that accepts L. The following proposition follows from Denition 3.3, since is a one-to-one and onto mapping from the set of states of a HFA to its alphabet.

Proposition 3.3 If M is a HFA, then M is unambiguous. 2 We next characterise HFA in terms of a subclass of regular sets.

Denition 3.6 (Closure under substrings and join) A regular language L is closed under substrings if whenever w 2 L and y is a nonempty substring of w, then y 2 L. A regular language L is closed under join if whenever xy yz 2 L and y = 6 e, then xyz 2 L. Theorem 3.4 A regular language L is closed under substrings and join i L = L(M) for some HFA M. Proof. The if part of the proof follows directly from Denitions 3.3 and 3.4. For the only if part of the proof assume that L is a regular language that is closed under substrings and join. We need to show that L = L(M) for some HFA M. Now, if L = , then M is the empty Hypertext nite automaton for any Hypertext database H = with jSj = 0. So, we assume that L 6= . Let  = fa1  : : :  aN g, N  1, with  = fai j ai 2 L and jjai jj = 1g  6= , since L 6= . In addition, let be a one-to-one and onto mapping from  to f1 : : :  N g such that (ai ) = i. We construct M = (A, Q, , Qi , Qf ) as follows: A = f1 : : :  N g.

Q = fs1  : : :  sN g.   Q A Q, where (sj  (sj ) sk ) 2  i aj ak 2 L, with aj ak = ;1( (sj )) ;1( (sk )) 2 L. Qi = Qf = Q. The result now follows from Denitions 3.3 and 3.4. 6

2

Denition 3.7 (HFA regular languages) A regular language L is said to be a HFA language if L = L(M) for some HFA M. We observe that a HFA language L(M) can be viewed as the set of all navigation paths, through the Hypertext database represented by M, resulting from the composition of two or more paths. Proposition 3.5 HFA languages are closed under intersection but they are not closed under union, complement or concatenation. Proof. The fact that HFA languages are closed under intersection follows from Theorem 3.4. Let L1 = fa, b, abg and L2 = fb, c, bcg. It follows that HFA languages are not closed under union, since abc 62 L1 L2 . In addition, HFA languages are not closed under complement, since abc 2 L1 but ab 62 L1 . Finally, let L3 = fc, d, cdg. It follows that HFA languages are not closed under concatenation, since abc 62 L1 L3 . 2

Denition 3.8 (A closure operator) A closure operator C is dened on + as follows: whenever L  + , C (L) is the closure of L under substrings and join. We say that a language J generates a language L, if J is the minimal cardinality language such that J  L and C (J ) = L. It now follows from Theorem 3.4 that C (L) = L i L is a HFA regular language and thus the set of all closed subsets of + is exactly the class of HFA regular languages. It can easily be veried that indeed C is a closure operator in the sense of DAVE90] and thus the set of all closed subsets of + is a complete lattice when ordered by set inclusion. The greatest lower bound of a set of HFA languages corresponds to their intersection and the least upper bound of a set of HFA languages corresponds to the closure of their union DAVE90]. Corollary 3.6 J generates a HFA regular language L i J = (! ;fa j ab 2 or ba 2 g), where ! = fa j a 2 L and jjajj = 1g and = fab j a, b 2 ! and ab 2 Lg 2 An immediate result of the above corollary is that if J generates a HFA regular language over , then jJ j jj + jj2 . This is to be expected, since the corresponding HFA has at most jj states and at most jj2 transitions. The following additional corollary is an immediate consequence of Theorem 3.2 and Proposition 3.3 noting that if, for example,  = fag and L = faag, then L is both a star-free and unambiguous regular language but L is not a HFA regular language. Corollary 3.7 The class of HFA regular languages is properly included in the intersection of the class of star-free regular languages and the class of unambiguous regular languages. 2

4 Hypertext Probabilistic Automata Herein we demonstrate how probabilistic automata RABI63, PAZ71] can be used to reduce the search space of a trail query with respect to a Hypertext database. We show that if the reachability relation of a Hypertext database is 2total then the language accepted by the HFA representing the Hypertext database is nite, thus reducing the search space of the model checking problem to a nite set. The accepted nite language is a subset of k for some k  1. In this case the solution can be computed eciently when k is small. We assume that the reader is familiar with the basic theory of probabilistic automata RABI63, PAZ71]. (The fundamental theory of nite homogeneous Markov chains (or simply Markov chains) can be found in KEME60, FELL68].) 7

Denition 4.1 (Hypertext probabilistic automata) A Hypertext Probabilistic Automaton (HPA) with cut-point  2 0,1) representing a Hypertext database, H = , is a sextuple of the form PH = (A, Q, M, , Qi, Qf ) (or simply P  whenever H is understood from context), where A = f1 : : :  N g, is a nite alphabet, with N =jSj. Q = S = fs1  : : :  sN g is the set of states of the HPA. M = fp(si sj )Pg(si sj 2 Q) is a stochastic N N matrix, that is 8si sj 2 Q, p(si sj )  0 and 8si 2 Q, Nj=1 p(si  sj ) = 1, where p is a function from S S to 0,1]. In addition, M satises the constraint that p(si  sj ) > 0 i (si  sj ) 2 R. The matrix M is called the transition matrix of P  and the probabilities p(si  sj )(si  sj 2 Q) are called the transition probabilities of P  . =P fp(si)g(si 2 Q) is an N ;dimensional stochastic row vector, that is 8si 2 Q, p(si)  0 and Ni=1 p(si) = 1. In addition, satises the constraint that p(si ) > 0 i 9sj 2 Q such that either (si  sj ) 2 R or (sj  si ) 2 R. The row vector is called the initial distribution of P  and the probabilities p(si )(si 2 Q) are called the initial probabilities of P  . Qi = Q and Qf = Q are the sets of initial and nal states, respectively. If Q = , then P  is called the empty Hypertext probabilistic automaton. We note that the initial distribution of a probabilistic automaton, as dened in RABI63], is degenerate, i.e. it has a single state si 2 S with p(si ) = 1. By NASU68, Theorem 4.9] the class of probabilistic automata remains unchanged with this restriction. Thus we can use the results in RABI63] on assuming that the initial distribution may not be degenerate. Our interpretation of an initial probability p(si ) is the probability that a user will start his/her navigation by inspecting the page associated with si. Correspondingly, our interpretation of a transition probability p(si  sj ) is the probability that a user who is browsing the page associated with si will next browse the page associated with the adjacent state sj . We will assume that the probabilities of the transition matrix and the initial distribution are determined from statistical information collected from past usage of the Hypertext database. Thus we make the (perhaps controversial) assumption that the probabilistic automaton is a Markov chain, implying that when a user is navigating through a Hypertext database his/her next navigation step is solely determined by their current position. Informally, the probability of a string w is the probability of its trace according to the underlying Markov chain. The language accepted by a HPA with cut-point  is the set of strings whose probability is greater than .

Denition 4.2 (The language accepted by a HPA) Given a HPA, P  , the probability of a string w = (s1 ) (s2 ) : : : (sn ) 2 A+ is denoted by P (w) and is given by P (w) = p(s1 )p(s1  s2 ) : : : p(sn;1  sn): The language accepted by P  , denoted by L(P  ), is given by

fw j w 2 A+ and P (w) > g: We note that P (e) is undened thus we assume without loss of generality that P (e) = 0. The next lemma investigates the special case of a HPA, whose underlying reachability relation is 2total. 8

Lemma 4.1 Let P  be a HPA representing a Hypertext database H = , where R is a 2total reachability relation. Then L(P  ) is an unambiguous and star-free regular language. Proof. To begin with, L(M) = L(P 0 ) is unambiguous and star-free by Proposition 3.3 and Theorem 3.2, respectively, where M is the HFA representing H. In order to conclude the proof we show that 8 2 (0,1), jL(P  )j2 !, i.e. jL(P  )j is nite. This

is sucient, since all nite languages are star-free and unambiguous regular languages. Let w 2 L(M). By Proposition 3.3 there is a unique trace < s1  : : :  sn > of w. Furthermore, P (w) = p(s1 )p(s1  s2 ) : : : p(sn;1  sn). Now, let be the maximal probability p(si ), where si 2 Q, and let be the maximal probability p(si  sj ), where si sj 2 Q. Then 8x 2 L(M), P (x)  k , where k =jjxjj. Furthermore, since R is 2total, it follows that 8x y 2 L(M) such that x is a substring of y, P (x) < P (y). It follows that  k tends to zero as k tends to innity and therefore P (x) tends to zero as jjxjj tends to innity. Next, 8k 2 ! there are only a nite number of strings of length k. It follows that 9k 2 ! such that 8x 2 L(M) such that jjxjj  k, P (x) < . The result that jL(P  )j 2 ! follows. 2 Note that, in general, L(P  ) is not a HFA regular language. For example, suppose that  = 0.33, ab bc 2 L(P  ) and P (ab) = P (bc) = 0:5. Then L(P  ) is not closed under join, since P (abc) 0:25. On the other hand, it can easily be shown that L(P  ) is closed under substrings. The next theorem gives an upper bound on the cardinality of L(P  ) which was shown to be nite in the proof of Lemma 4.1.

Theorem 4.2 Let be the maximal probability p(si), where si 2 Q, and let be the maximal probability p(si  sj ), where si  sj 2 Q. If the reachability relation is 2total and  > 0, then jL(P  )j

N k for some k 2 ! that satises  k < : 2 In this case L(P  )  k for some k  1. Thus in this special case model checking can be computed eciently if k is small. The user can always increase the cut-point  in order to decrease k. Moreover, if the HPA representing the Hypertext database is assumed to be xed, then model checking can be carried out in polynomial time of order k, provided the conditions of Theorem 4.2 are satised. Obviously, if  = 0 then the model checking problem is as hard as in the deterministic case, i.e. it is NP-complete LEVE94]. The following theorem generalises Lemma 4.1 to total reachability relations which may not be 2total.

Theorem 4.3 Let P  be a HPA representing a Hypertext database H = . Then L(P  ) is a star-free and unambiguous regular language.

Proof. This proof utilises the previous theorem and looks at the equivalence classes x/E, of the equivalence relation E, such that xEy, where x y 2 L(M) i P (x) = P (y). It follows that L(P  ) is the union of all the equivalence classes x/E such that P (x) > . We claim that each equivalence class x/E is a star-free and unambiguous regular language. If jx/Ej is nite, then the result follows, so assume that jx/Ej is innite. Now, 8y 2 x/E, y is of the form wz k , with k 2 ! and k  0, where either w 6= e or z 6= e and all the symbols in both w and z are distinct. In addition, if z 6= e, then it must be the case that P (z ) = 1 and also that w wz wz k+1 2 x/E. It therefore follows that there are only a nite number of strings wz 2 x/E, since all the symbols in both w and z are distinct. Moreover, each language fy j y 2 x/E, y = wz k with k 2 ! and k  0g, which we denote by L(y), is a regular language, since a nite automaton accepting this language can easily be constructed. Therefore, x/E is a regular language, since it is the union of the nite set of languages,

9

L(y), and regular languages are closed under union. Furthermore, it can be veried that x/E is a star-free regular language, since star-free regular languages are closed under union and Theorem 3.1 implies that each L(y) is a star-free regular language. Moreover, it can also be veried that each regular language L(y) is unambiguous and that these languages are pairwise disjoint. Therefore, x/E is also an unambiguous regular language, since unambiguous regular languages are closed under union whenever the languages unioned are pairwise disjoint. Finally, by Theorem 4.2 we can deduce that there are only a nite number of equivalence classes x/E such that P (x) > . Therefore, L(P  ) is a star-free and unambiguous regular language, since as mentioned above star-free regular languages are closed under union and, in addition, unambiguous regular languages are closed under union whenever the languages unioned are pairwise disjoint. 2 When the reachability relation is not 2total Theorem 4.3 does not guarantee that the cardinality of L(P  ) is nite. Still, by inspecting the proof of Theorem 4.3, we can see that the strings in L(P  ) all have the simple form wzk , with k  1 and P (z) = 1. Thus if k > 1, then wzk is the label of a cycle, which is easily detected as follows. Let fsi1  : : : sim g be the set of states in S having degree one. Furthermore, let fsj1  : : : sjm g be the set of states in S such that 8q 2 f1 : : :  mg sjq is adjacent to siq . Now, all we need to do is to test whether there is a cycle in R from sjq to siq such that the degree of each state in this cycle is one. This test is successful i there exists a string z k 2 L(P  ) which is the label of such a cycle. Thus by letting to be the maximal probability p(si ) < 1, where si 2 Q, and to be the maximal probability p(si sj ) < 1, where si sj 2 Q, we can utilise Theorem 4.2 to obtain a nite approximation of the cardinality of L(P  ), which ignores these cycles when they do exist. We next consider the concept of an isolated cut-point RABI63, PAZ71], which corresponds to the concept of bounded error probability GILL77].

Denition 4.3 (HPA with isolated cut-point) The cut-point  of a HPA P  is isolated (or simply  is an isolated cut-point whenever P  is understood from context) if 9" > 0 such that 8x 2 +, jP (x) ; j  ", where jrj is the absolute value of a real number, r (cf. RABI63]). The concept of isolated cut-point has been of fundamental importance to the theory of probabilistic automata, since it was shown that if a probabilistic automaton has an isolated cut-point, then the language accepted by the probabilistic automaton is regular RABI63, PAZ71]. On the other hand, a language accepted by a probabilistic automaton can be regular but its cut-point need not be isolated RABI63, PAZ71]. The following result shows that if  > 0, then we can nd a HPA P  , whose cut-point  is isolated, that is equivalent to P  in the sense that the regular languages that P  and P  accept are the same. It can be veried that if (L = L(P 0 ))  +, then 9x 2 + such that P (x) = 0, and thus  = 0 is not an isolated cut-point unless L = +.

Theorem 4.4 If  > 0, then there exists an isolated cut-point    such that L(P  ) = L(P  ). Proof. If the cut-point  is already isolated, then take  = . So, assume that  is not an isolated cut-point and thus 9x 2 + such that P (x) = . Let  be the minimal probability of any string in y 2 L(P  ), i.e. 8y 2 L(P  ),  <  P (y) by the proof of Theorem 4.3  exists, since there are only a nite number of equivalence classes y/E such that P (y) > . Now, by the density of the real numbers we can constrain  to satisfy  <  <  . The result follows by the denition of an isolated cut-point, since " can be chosen as the minimum of  ;  and  ; . 2

10

5 Probabilistic Hypertext Query Language Using the approach presented in the previous section we extend HQL to be probabilistic and call the resulting query language PHQL. In PHQL a trail, T, probabilistically logically implies a query with cut-point  i T logically implies the query and P (T) > . Our approach is similar to that taken in LEHM82] wherein models of logical formulae are Markov chains, and a sequence of states (i.e. a trail) satises a logical formula if the formula is true in the model \with probability one". Our approach di ers from that in LEHM82] in that logical PHQL formulae (i.e. trail queries) are satised with probability greater than a cut-point  2 0,1). This is also in contrast to the approach taken in FAGI90] which allows explicit reasoning about probabilities in logical formulae.

Denition 5.1 (Probabilistic logical implication) Let H = be a Hypertext database,

and T be a trail in R. Then T probabilistically logically implies a trail query, , with respect to H and cut-point  2 0,1) (or simply T logically implies , if H and  are understood from context), written T j  (or simply T j  whenever  is understood from context), if T j=  and P (T) > :

Denition 5.2 (Probabilistic trail query) A probabilistic trail query (or simply a probabilistic query or a query whenever no ambiguity arises) is a trail query, , viewed as a mapping from Hypertext databases to sets of trails such that its inputs are a Hypertext database H = and a cut-point  2 0,1), and its output (H, ) is dened by

fT j T is a trail in R and T j g: A trail T in this set is called an answer of (H, ). It follows that j(H, 1 )j j(H, 2 )j, whenever 2 1 . Thus, j(H, )j is maximal when  = 0. When  = 0 we abbreviate (H, 0) to (H).

Denition 5.3 (Probabilistic model checking) A Hypertext database H = is a probabilistic model of a probabilistic trail query, , with cut-point  2 0,1), if 9T in R such that T is

an answer of (H, ). The probabilistic model checking problem is the problem of deciding whether a Hypertext database is a probabilistic model of a probabilistic trail query with a given cut-point. The following proposition shows that in the general case the probabilistic model checking problem is as hard as model checking.

Proposition 5.1 The probabilistic model checking problem is NP-complete. Proof. We rstly show that T j  is in NP. In LEVE94] we have shown that T j=  is in NP.

The result follows, since the condition P (T) >  can easily be checked in polynomial time in #T and the sizes of the initial distribution, , and the transition matrix, M, of P  . Secondly, T j  is NP-hard, since T j=  reduces to T j0 . then 2 Although in general probabilistic model checking is intractable, in the case when R is 2total and  > 0 we can utilise the result of Theorem 4.2, implying that j(H, )j jL(P  )j N k , where k 2 ! satises  k < . If R is not 2total, then we can apply the comment made after Theorem 4.3. As a heuristic users can tune the value of  in order that the search space of trails be manageable, or alternatively specify k so that only trails whose count is at most k are returned. 11

6 Hypertext Random Automata Hereafter we investigate the use of random Turing machines LEEU56, RABI76, GILL77, JOHN90] with the aim of reducing the search time of nding a trail that logically implies a given trail query. Such use of random Turing machines is benecial, as long as the probability that the returned trail indeed logically implies the trail query is high normally high means that the probability is greater than one half. This approach is in contrast to the one taken in Section 4, since in this case we will be doing a totally random search without any use of precompiled navigation statistics. Let H = be a Hypertext database and  be a trail query. By a result in LEVE94], given a trail T in R we can check whether T j=  in polynomial time in the sizes of T,  and I. Thus when using a random Turing machine in order to solve the model checking problem, if H is not a model of  (i.e. for all T in R, T 6j= ), then the random Turing machine will always answer correctly with a \no". On the other hand, if H is a model of , then we are interested in nding the probability that the random Turing machine will answer correctly with a \yes". This probability is determined by the proportion of trails in H that are models of . In particular, when the probability is greater than one half then this technique is viable. We now dene the notions of a random Hypertext automaton and the languages that are accepted by such an automaton. In the next section we will calculate a lower bound on the probability that such an automaton will randomly nd a trail that logically implies a trail query with respect to a Hypertext database.

Denition 6.1 (Hypertext random automata) The Hypertext Random Automaton (HRA) representing a Hypertext database, H = , is a HPA, PH0 , denoted by RH (or simply R if H

is understood from context), such that both the nonzero initial probabilities p(si ) and the nonzero transition probabilities p(si  sj ) are uniformly distributed. A HRA R with success probability  2 (0,1) is denoted by R . The success probability attached to a HRA represents the probability (or informally the condence) that the user would like to have in the sense that a particular trail chosen at random logically implies the trail query the user is interested in.

Denition 6.2 (Languages accepted by HRA) A regular language L over A is accepted by a HRA, R, if either L = or 9k 2 ! which satises the following constraints: 1. 8i 2 f1 : : :  ng P (wi ) > 0 (i.e. wi 2 L(RH )), and P 2. ni=1 P (wi )  , where L \ Ak = fw1  : : :  wn g.

The minimal natural number, k, which satises the above two constraints is called the depth of L with respect to R (or simply the depth of L whenever R is understood from context).

We observe that the depth of any language L 6= accepted by R is greater than or equal to one, since by the denition of a language, e 62 L. The next proposition follows from Denition 6.2 on recalling that by Denition 6.1 the initial and transition probabilities are uniformly distributed.

Proposition 6.1 If L is accepted by R, then the inequality j L \ Ak j   j L(RH ) \ Ak j obtains.

2 12

A nonempty regular language L accepted by a HRA R contains at least one answer to a trail query, say , that the user is interested in. Obviously, given w 2 A+ we can check in polynomial time in the sizes of w,  and I whether w 2 L, since L is regular. Thus it is always easy to verify whether a particular trail logically implies the trail query  or not. Next, let R be a HRA with state set Q = fs1  : : :  sN g. In addition, let 0 denote the number of initial probabilities p(si ) such that p(si ) > 0 and let i denote the number of transition probabilities p(si  sj ) such that sj is adjacent to si . Finally, let adjacent(si ) denote any sequence < s1i  : : :  si i >, where fs1i  : : :  si i g is the set of all the states adjacent to si. We assume that a function random(n), where n 2 !, is available and that it returns a natural number uniformly distributed between 1 and n random(n) corresponds to rolling an unbiased n-sided dice. Now, suppose that L is accepted by R and that the depth of L is k. Then the following algorithm, designated random trail(R k), is a linear random Turing machine which returns a trail, T, of count k such that by Proposition 6.1 the probability that T 2 L is greater than or equal to . Ultimately we accept the random trail, T, if T 2 L, otherwise we reject T.

Algorithm 1 (random trail(R k)) 1. begin

2. i := random(0 ) 3. T1] := i 4. cur state := ;1 (i) 5. for j = 2 to k do 6. seq := adjacent(cur state) 7. i := random(i) 8. cur state := the ith element of seq 9. i := (cur state) 10. Tj ] := i 11. end for 12. #T = k 13. return T 14. end. A HRA R can also be viewed as inducing a binomial distribution FELL68] with success probability . If repeated independent trials are allowed, then we may ask ourselves how many trials are necessary in order to have condence   < 1 that at least one trial results in a success. That is, we must nd m 2 ! satisfying the inequality

  1 ; (1 ; )m :

7 Query Answering with Respect to Hypertext Random Automata We now utilise the results of the previous section in order to discuss possible improvements to the time complexity of model checking. Let H = be a Hypertext database,  be a trail query and let R be the HRA representing H. Furthermore, let L = (H) be the language induced by the output of the trail query  with respect to H. Assume the user has acquired the knowledge that L is accepted by R and that the depth of L with respect to R is k. In this case, the user can invoke Algorithm 1, and assign T to random trail(R k). If T j= , then H is a model of , with T being an answer to (H), otherwise 13

the conclusion that H is not a model of  can be deduced with error probability of (1 ; ). As mentioned at the end of the previous section we can reduce this probability to (1 ;)m if we invoke random trail(R k) repetitively m times and return an answer to (H) if one out of the m invocations is successful. The rest of this section is concerned with nding a lower bound on the probability of success, , when a given Hypertext database is a model of a given trail query. The following assumptions are made: S is the set of states of R and N  1, with jSj= N . H = is a Hypertext database and R is a HRA representing H. k  1 is the depth of the regular languages accepted by R with success probability  2 (0,1).  is the trail query c1 ^ : : : ^ cn, n  1, and we assume without loss of generality that  = fc1  : : :  cn g. T is assigned to random trail(R k).

Theorem 7.1 Assume that R is complete and that 8q 2 f1 : : :  ng there exists exactly one state sj  j 2 f1 : : :  N g, such that cq 2 I(sj ). Then a lower bound on the number of trails that logically imply  is given by

j (H ) jm

 !  !  ! m m = N k ; 1 (N ; 1)k + 2 (N ; 2)k ; m3 (N ; 3)k + : : : + (;1)m (N ; m)k  ! m X i = (;1) mi (N ; i)k  (1) i=0

where m  1 is the number of states sj such that I(sj ) 6= . In addition, m n, m N and m k. Proof. Firstly, m n, since 8q 2 f1 : : :  ng there exists exactly one state sj  j 2 f1 : : :  N g, such that cq 2 I(sj ). Secondly, m N , since jSj= N . Thirdly, if k < m, then all of the trails in N k do not logically imply . Thus the maximal value of m is the minimum of the values n N and k. The term N k gives in R whose count is k, assuming that R is ;mthe totalk number of trails k complete. The term 1 (N ; 1) = m(N ; 1) gives the number of trails that do not logically imply; , which do not contain at least one of the states sj such that I(sj ) 6= . In general, the term mi (N ; i)k gives the number of trails that do not logically imply , which do not contain at least i of the states sj such that I(sj ) 6= . The result now follows from the principle of inclusion and exclusion GRIM89]. 2

Denition 7.1 (The minimal number of trails that logically imply ) The minimal value of j (H ) jm is denoted by min(j (H ) jm ). We conjecture that the value min(j (H ) jm ) is obtained when m is maximal, that is when m is the minimum of the values n N and k. Using this conjecture we obtain the next corollary.

Corollary 7.2 A lower bound on the probability of success, , of a HRA, R, denoted by P (), on assuming that the reachability relation is complete, is given by

m P () = min(j N(kH ) j ) : 2

14

We will now assume that the reachability relation R of the Hypertext database H is -regular,  2 f1 : : :  N g. When  = N, then R is complete. The following denition is used in Theorem 7.3.

Denition 7.2 (Proper subtraction) The proper subtraction of two natural numbers, a and b, denoted by a  b, is given by ( b a  b = a0 ; b ifif aa 

Suggest Documents