The Laborious Way From Data Mining to Web Log ... - Semantic Scholar

14 downloads 50 Views 307KB Size Report
The web is a large source of information that can be turned into knowledge. Part of ... Data mining is already being used to throw light in various aspects of web ...
The Laborious Way From Data Mining to Web Log Mining Myra Spiliopoulou Institut fur Wirtschaftsinformatik, Humboldt-Universitat zu Berlin Spandauer Str. 1, D-10178 Berlin, Germany Phone: +49.30.20935730, Fax: +49.30.20935741 Email: [email protected] Abstract

The web is a large source of information that can be turned into knowledge. Part of this knowledge concerns the usage of the web itself and is invaluable to the organization of web sites that meet their purposes and prevent disorientation. Data mining is already being used to throw light in various aspects of web utilization. For one major aspect, the discovery of navigation patterns, we show that a new mining model is necessary. We formalize the notion of navigation pattern, introduce a model for navigation pattern discovery by extending the classical model of association rules' discovery, and establish the processing framework of this model. Conventional tools for association rule discovery and for sequence analysis cannot deal with this discovery problem. However, we show how they can be used as preprocessors to reduce the search space before the actual mining phase.

1 Introduction As the information o ered in the Web grows daily, obtaining information from it becomes more and more tedious. Attempts to tame the web into human dimensions aim mostly at discovering the structure hidden behind semistructured pages and exploiting it to discover semantics. The main diculty lays in the unstructured web content that is not easily amenable to external regulations enforcing standards and structure. At the same time, the Web has a rigid network structure, and its main information acquisition mechanism is still navigation. In this network of pages connected to each other for obvious and less obvious reasons, providers may fail to ensure that important o ers are indeed reached by interested users, while users are subject to disorientation due to the lost-in-hyperspace syndrom. To deal with this problem, two families of tools emerge. The rst family includes tools that accompany the user in her navigation, learn from her behaviour, make suggestions to her on the way and, occasionally customize the way for her. Alexa (http://www.alexa.com) is one of the sophisticated members of this family. The second family of tools analyze the activities of users o -line. Their goal is to provide insights in the semantics of a web site's structure, by discovering how this structure is actually utilized. The long term goal is the customization of the web site based on knowledge from past usage, so that this family can provide the intelligence used by the previous group of tools. Most of the software available nowadays to this task performs some basic analysis of web log les, including statistical results on trac load and access to pages or small page sequences. A detailed discussion can be found in [ZXH98], concluding to their inappropriateness for thorough analysis of access patterns. Data mining is the methodology of choice to perform this type of analysis. As this new eld emerges, issues of discovery and analysis of web access patterns are discussed in [CPY96, CMS97b, PE98, SF98, ZXH98, Wex96]. The goal is common: discovery of knowledge on the navigational behaviour of users to predict future trends. However, their viewpoints are quite di erent. 1

Related work. The miner proposed in one of the earliest works in this area [CPY96] discovers

statistically dominant paths using a methodology for the discovery of association rules. The \Footprints" tool of [Wex96] records the footprints left behind by web site visitors and accumulates them into frequently accessed paths. The \PageGather" tool of [PE98] uses a clustering methodology to discover web pages visited together and to place them in the same group. The tool proposed in [PPR96] also groups semantically relevant web pages together. It considers text similarity, topological proximity of web pages and frequency of access on individual links. It uses a technique from the eld of information retrieval, called \spreading of activation", which has similarities to measuring ow in pipelines. Although good results for the grouping of individual source-page & target-page pairs can be achieved, there is no straightforward way of generalizing this approach to longer paths. The \WEBMINER" of [CMS97a] provides a query language, with which the user can initiate searches for paths conforming to more sophisticated criteria than high frequency of access. However, the miners invoked to process such queries have not been designed to cope with such criteria: According to [CMS97a], the miner for association rules and the miner for sequential access patterns incorporated to the WEBMINER are conventional tools, of which the former is slightly customized to improve its performance. The disadvantage of this approach is the necessity of a postprocessing module, which computes the information required to verify the criteria of the language not supported by the miner itself. The \web log miner" of [ZXH98] exploits OLAP technology for prediction, classi cation and time-series analysis of web log data. They obtain interesting results on web trac analysis and on the evolution of user behaviour (e.g. preferred pages) over time. However, the orthogonal issue of assessing the users' behaviour to detect and prevent disorientation by site redesign, is left open. In these studies, the mining methodology is borrowed from a known model (clustering, association rules, sequence discovery) and adjusted by pre- and postprocessing activities, if necessary. The generic problem of pattern discovery is addressed in [SA96, MT96, Wan97]: All frequent patterns of size 1 are discovered and stepwise extended to patterns of larger sizes; patterns that do not satisfy the frequence threshold are pruned out at each step. Depending on whether the pattern should be a sequence or a more generic structure, the size being increased at each step can be length or a more sophisticated measure. A problem with those approaches is the dominant role assigned to the frequence of accesses on a page. Intuitively, the next page visited by a user depends on some of the pages accessed before. To better drr the impact of this problem, we discuss a small example. Example 1: The home page H of a ctive institution, has links with anchors Location and later Be our guest!. Anchor Be our guest! points to the postal address, while Location points to a tour through the institution's premises. Visitors looking for the postal address may take the wrong turn and navigate a while through the buildings across di erent routes, before they go back to H and select the next link. Let's assume that H was accessed for the rst time by 1000 visitors, 800 of whom looked for and found the postal address after taking the wrong turn. Tools assuming that a page revisit indicates the beginning of a new session [CPY96, PPR96] will miss this navigation pattern. Tools clustering relevant pages together [PE98] may or may not nd the home page and the postal address as semantically relevant, but this is not very interesting in itself. An association rule H!Be our guest! is of little interest, too. It also has a very low con dence (800=1800), because H is visited 1800 times in total. A miner discovering sequences with con dence at least 80% will nd the sequence from the second access at H to the postal address, but this sequence has nothing special in it. The really strange sequence from the rst access at H to the postal address will be considered frequent only if all 800 visitors followed the same route after taking the wrong turn and decided to go back to H from the very same node in this route. What we need here is a miner that (a) distinguishes between the rst and second visit to H and (b) skips the nodes belonging to the tour through the buildings. In fact, we need a miner that can discover a generalized form of sequence (H,1)?(Be our guest!,1), where (H,1) denotes the rst visit to H. In this paper, we propose such a miner. 4 2

Our work. In this study, we rst focus on the theory behind the navigation patterns of users in the web. Our aim is to model navigation patterns before attempting to discover any of them. As discussed above, it is necessary to formalize known facts about navigation, such as that not all pages across a path are of equal importance and that users tend to revisit pages previously accessed [TG97]. Our formal model is based on the notion of \generalized sequence" and of a \template" of variables. After providing a formal model for navigation patterns, we turn to the question of how to discover them using mining techniques. We propose a miner that alleviates the shortcomings of the techniques mentioned above. This miner discovers the generalized sequences that conform to a template of variables and satisfy statistical thresholds. It operates on a tree structure, to which the original log le has been aggregated. This work is part of our Web Utilization Miner project WUM, where a mining language [SF98] and a visualization interface are also developed (http://www.wiwi.hu-berlin.de/myra/WUM/). Navigational hints. This study is organized as follows: In the next section, we provide the

base de nitions of our formalism by modelling a navigation pattern as a \generalized sequence", introducing statistical properties for it and de ning \templates" of variables over generalized sequences. The g-sequence discovery process is applied on an aggregated version of the log le. The aggregation is described in section 3. Section 4 describes the miner itself. In section 5, we compare our new mining paradigm with two existing paradigms appropriate for the discovery of navigation patterns in web logs: association rule discovery and sequence discovery. We show their shortcomings in the discovery process, but also depict useful ways of exploiting them for pruning out uninteresting data. The last section concludes our study. There are two routes in reading this paper: The sequential reading is described above. An alternative reading would cover sections 2 and 5 before returning to the description of the log le aggregation and the presentation of our miner. To support this reading, we have kept section 5 fairly self-contained.

2 A Generalized Notion of Sequence A navigation pattern in the web can be conceived as a sequence of web page accesses. However, not all members of this sequence are of equal importance. An indicator of this is also the fact that the most frequently accessed pages in a site are those serving as link collections. As noted in the introduction, a missing link may hide the relevance between two pages or lend importance to all pages in-between. In the mining context, we are summarizing the behaviour of many users moving in the web. So, we must formally describe navigation patterns in such a way that summarization on parts of the traversed sequences is performed, while the trac in other parts is simply ignored. To achieve this task, we de ne a navigation pattern in the web as a generalized notion of sequence, the materialization of which is a directed acyclic graph. This generalized sequence, a g-sequence, may contain wildcards that match with any element or subsequence.

2.1 Conventional Sequences

The de nitions of this section are close to the models describing the matching of strings with regular expressions. However, our aim is rather di erent. Instead of looking for strings matching a given regular expression, we must nd all sequences that match an expression containing wildcards and variables and being subject to restrictions on the number of matches found. This we model as a problem of rule discovery, adjusting and extending de nitions from that domain as necessary.

3

2.1.1 A Web Log File and its Sequences

For our de nitions, we use a set U of elements and a log le L with data entries s 2 U  , where  is the Kleene star operator. We denote the empty sequence as " 2 U  . Obviously, " 62 L. In web log mining, U is the set of web pages in the site providing the log being mined. For those pages, we might keep some meta-data useful to specify potentially interesting pages, as described in [SF98]. In this study, those properties are not of interest. The log le L is a multiset of recorded sequences. It is not a simple set, since a sequence may appear more than once. The combination of data entries into sequences is part of the data preparation phase, which is described in detail in [CMS97a, SF98].

De nition 1: Let U be a set of elements. A \sequence" s is a vector of U  .

The function length() returns the length of s. The function pre x (s; i) returns the subsequence comprised of the rst i elements of s. If s0 = pre x (s; i), we say that s0 is a \pre x of" s and denote this as s0  s. 2 Hereafter, we use array notation when referring to the elements of a sequence, i.e. s[i] denotes the ith element of the sequence s. When we want to observe s as a concatenation of the subsequences x; y, we use the notation s = x  y instead. One aspect we would like to stress in the context of web access sequences concerns the multiple occurences of a page in a sequence. Tauscher and Greenberg show in [TG97] that web users tend to move backwards and revisit pages with a high frequency. Such revisits may be part of a guided tour or may indicate disorientation. In any case, their existence is precious as information and should be retained. To model cycles in a sequence, we label each element of the sequence with its occurence number within the sequence, thus distinguishing between rst, second, third etc occurence of the same page. This augmentation implies that all pages in the sequences and their generalizations introduced hereafter are distinct.

2.1.2 Statistics of Sequences

To introduce statistics in the context of sequences, we adjust the measures proposed in association rules' discovery [ATS93]. In the context of association rules, a rule A ! B has a \con dence" c, if the probability of event B given event A exceeds c. To avoid trivial rules, it is required that A ^ B has a minimum \support" x, i.e. that A and B appear at least x times in the log le. De nition 2: Let s 2 U  be a sequences and L be a log le. The \support" of s in L, support(s; L) is the number of sequences in L that have the form s  y, where y is an arbitrary, possibly empty, subsequence. 2 In the following, we use the notation support(s), since the log le is constant in our discussion. Lemma 1: The support of the empty sequence " 2 U  is equal to the cardinality of L, i.e. support(") = jLj, because each sequence x 2 L has the form x = "  x.

De nition 3: Let s; s0 2 U  be two sequences and L be the log le. The \con dence of s0 following s " is the conditional probability of s  s0 given s, or, equivalently, the percentage of sequences that contain s  s0 among those containing s. 0 confidence(s0 ; s) = support(s  s ) support(s)

2 This extended de nition is still restrictive. In particular, how can we specify that in a rule s and s0 need not be directly concatenated but separated by a third subsequence, the contents of which are not of interest? Or that s should have a particular structure, e.g. a minimum or maximum length, or appear at a speci c position in the sequences recorded in L? In fact, we can 4

de ne the support(s) to be the number of sequences that begin with s, but then we cannot enforce s to appear in any other position than the rst one. We remove those limitations by by two extensions: First, to allow speci cations on the structure of the sequences, we generalize the notion of a sequence by permitting wildcards. Second, we allow the speci cation of statistical and structural restrictions for any subsequence in the context of its predecessors.

2.2 Generalized Sequences

Generalized sequences are sequences of elements that either belong to the set U or are wildcards. In this section, we formally de ne this type of sequence as well as the notion of template as a vector of variables ranging over generalized sequences.

2.2.1 Sequences with Wildcards

Outside the set of elements U, we specify a \wildcard", denoted as ?[low; high] and having the semantic of being matched by any sequence of elements that has length at least low  0 and at most high  low. With a little abuse of notation, high may take the special value +1 indicating a sequence of arbitrary length. In the following, we denote a wildcard with ?, if its range [low; high] is not of interest. For the particular wildcard ?[0; +1], we use the symbol . Furthermore, we specify outside the set U a \root" symbol, denoted as ^ and having the semantic of indicating the beginning of a sequence, prior to the rst element. De nition 4: Let U be a set of elements and the root ^ and the wildcard ? be two special symbols so selected that they do not belong to U. Let U+ = U [ f^; ?g. A \generalized sequence" or \gsequence" is a vector g 2 U+ such that: (i) at least one element of g belongs to U, (ii) the rst element is either the ^or an element of U, (iii) no other but the rst element of g can be equal to the root and (iv) no two adjacent elements of g are wildcards. 2

Lemma 2: A g-sequence g has the form g ?g ?: : :?gn?, where g 2 U [f^g and g ; : : : ; gn 2 U. 1

2

1

2

This statement holds, because we can insert a dummy wildcard ?[0; 0] between any two consecutive elements gi; gi+1. According to this lemma, we can introduce a notion of length for g-sequences. De nition 5: The \length" of a g-sequence g = g1 ? : : : ? gn ? is the number of non-wildcard elements in it. 2 We can now introduce the notion of match for a sequence against a (part of a) g-sequence. Intuitively, an element of a g-sequence that is not a wildcard matches only with itself, while a wildcard matches with any sequence, the length of which falls within the range speci ed by the wildcard range. More formally:

De nition 6: Let g be a g-sequence and a = a ?[low ; high ]a ?[low ; high ] : : :an?[lown; highn] a subsequence of g. Since a is a subsequence, its rst segment needs not begin with the ^. We say that a sequence s 2 U  \matches" a i s = x  x : : :  xn , where: For i = 2; : : : ; n: xi = ai yi, where yi 2 U  is an arbitrary sequence with length(yi ) 2 [lowi ; highi]. For i = 1: x = y  y , where length(y ) 2 [low ; hight ]. For y it holds that y = " if a = ^; 1

1

1

1

1

1

1

1

1

2

2

2

2

1

otherwise y = a1. We say that a subsequence xi is a \segment" of s that matches the ith \segment" of a, ai ? [lowi; highi ]. 2 5

The rst segment a1 ? is treated in a particular way. If it contains an arbitrary element of U, this element may appear anywhere in the sequence s. If it contains the root ^, it is matched by the empty sequence; this implies that the next segment a2? is matched by the beginning of s.

Example 2: Let U be a set of web pages, identi ed by the lowercase letters of the latin alphabet. Let L be a log le comprised of the following sequences, where the \"-symbol of concatenation has been omitted: abcd  a  b  c  d; klmnab; abdec. 1. The g-sequence a  c is matched by abcd and abdec, while a ? [0; 2] is matched only by abcd. 2. The g-sequence bd is matched by abdec. The g-sequence ^  bd is also matched by abdec and has actually the same semantics as bd. 3. The subsequence ^ is matched by all sequences in L. The subsequence ^ ? [5; 10] is only matched by klmnab; abdec; abcd has only four elements and does not qualify. 4 The notion of g-sequence introduced in Def. 4 formalizes the intuitive term of navigation pattern as a sequence of accessed web pages, some of which are of no interest. Thus, the discovery of interesting access patterns in a web log is the discovery of g-sequences satisfying some statistical and/or contextual criteria of interestingness.

2.2.2 Statistics for g-Sequences

We now introduce statistic measures for g-sequences. The notion of support is the most important one, because the other measures are computed from it. For this, we generalize the notion of support introduced in Def. 2 as follows.

De nition 7: Let L be the log le over sequences from U  and let s 2 U  be a g-sequence. Further, let s0 2 U  , such that s0  s (see Def. 1). The support of s0 , support(s0 ), is the number of sequences in L that have the form x  y, where x matches s0 according to Def. 6. 2 Lemma 3: The support of the empty g-sequence, as well as the support of the ^ is equal to the cardinality of L: support(") = support(^) = jLj. +

+

A pre x s0 of a g-sequence may have the form ^ ? [low; high]. So, one may wonder about the support values of the elements in this subsequence. As already noted in Example 2, ^ matches all elements in L, so that support(^) = jLj. The pre x ^ ? [low; high] is matched by all sequences having at least low and at most high elements at their beginning. The support of the pre x is then the number of those sequences.

Example 3: Let U be a set of web pages identi ed by lowercase letters. Let L be the log le, comprised of the sequences abcd, abdc, klmnab, bcpdqr, abdec, abcd. Note that the rst sequence appears twice in L. Then jLj = 6. 1. The support of abc is 2, as abc appears only in the rst and last sequence of L. However, support(ab  c) = 4: The rst and last sequence start with abc, the second and fth one start with ab followed by an arbitrary subsequence, d resp. de, after which c appears. 2. support(cd) = 2, since the rst and last sequence contain cd. However, support(^cd) = 0, because no sequence starts with cd. 3. support(ab) = 5, since all but the forth sequence in L contain ab at some position. However, support(^ ? [1; +1]ab) = 1, since only the third sequence has at least one element before ab. 4

From the above example we see that the following lemma holds: Lemma 4: For a sequence s 2 U  and a log le L it holds that: support(s) = support(^?[0; +1]s). Using the notion of support, we can de ne two derived measures for g-sequences, in a similar way as they are de ned for association rules.

6

De nition 8: Let L be the log le over sequences from U  , let s 2 U  be a sequence and let g be a g-sequence. Then, the \con dence of s after" or \following" g, or equivalently the \con dence of g  s" is the probability of s appearing after g: confidence(s; g) = support(g  s) support(g)

2 We consider one more quality measure, the \improvement", again adjusted from the respective measure for association rules [BL97]. De nition 9: Let g be a g-sequence, s be a sequence and L be the log le. The \improvement of s after/following g", improvement(s; g), is the ratio: improvement(s; g) =

support(gs) support(g) support(s)

jLj

support(g  s)  jLj = support(g)  support(s)

2 This measure takes the overall importance of s within L into account. As pointed out in [BL97], the improvement of a rule A ! B indicates whether it is better predicting B given A (improvement > 1) rather than assuming that B will occur anyway (improvement < 1). In our context, this measure speci es whether it is better to predict the appearance of s after g rather than assuming that s will appear independently of g. Note that a closely related measure is used in the area of classi cation models under the name \lift" [BL97].

2.3 Templates for Generalized Sequences

We have introduced g-sequences as extensions of sequences. A g-sequence refers to explicitly speci ed elements from the set U. To discover all g-sequences in a log le, we need the notion of template for g-sequences.

De nition 10: Let  be a set of symbols, such that the special symbols root ^ and wildcard ? do not belong to  and let + =  [ f^; ?g. The function f : + ! U+ is a \binding from + to U", i f(?) = ?, f(^) = ^ and for each x 2  : f(x) 2 U. 2 Obviously, + is a set of variables. The fact that + contains the wildcard means that we allow anonymous variables. A binding is then a mapping of variables to elements of U, which assigns the anonymous wildcard variable to itself.

De nition 11: A \template" is a vector t 2  for which the following conditions hold: (i) The elements of t that belong to  are distinct from each other. (ii) There exists a binding f :  ! U such that f(t[1])  f(t[2])  : : : f(t[length(t)]) is a g-sequence. +

+

+

2 A template is thus a vector of variables with distinct names, which can be mapped to a g-sequence by some binding function. Similarly to Lemma 2, we can draw some conclusions on the structure of a template. Lemma 5: Let t 2 + be a template. Then t = t1 ?t2 ?: : :tn?, where t1 2  [f^g, t2 ; : : : ; tn 2 . This lemma says that a template is a vector of segments ti ? for i = 1; : : : ; n. Moreover, the rst element of a template cannot be a wildcard. This follows from the Def. 4 of g-sequences. There, it is stated that the rst element of a g-sequence is either an element of U or the root ^. We call the elements of t that belong to  \named variables".

7

De nition 12: Let t 2  be a template of length n. The \set of solutions for t" S(t) is the set: S(t) = fg 2 Uj9fg :  ! U ; g = f(t[1])  f(t[2])  : : :  f(t[n])g +

+

A solution to a template is a g-sequence \matching" the template. 2 Hence, a member of the set S(t) is a g-sequence, the non-wildcard elements of which can be bound to the named variables of t. Recalling Def. 6 on the sequences matching a g-sequence: De nition 13: Let t be a template of length n. A sequence s 2 U  \satis es" t i there is function f that binds the named variables of t to elements of s in such a way that f(t[1])  f(t[2])  : : :  f(t[n] is a g-sequence matched by s. 2 In general, there can be more than one such functions, since a wildcard in t can be bound on an element of s or on a wildcard, thus ignoring some elements of s. Example 4: Let A ? [1; 2]B be a template, let U = fa; b; c; dg and let s = abcd be a sequence. This sequence satis es t, because the function f with f(A) = a, f(?[1; 2]) = ?[1; 2] and f(B) = d produces the g-sequence a ? [1; 2]b which is matched by s. The sequence s0 = abc also satis es t, using the binding function f 0 with f 0 (A) = a, f 0 (?[1; 2]) = ?[1; 2] and f 0 (B) = c. Finally, the sequence s00 = cd does not satisfy t: Without listing all possible binding functions, we can see that all of them should bind ?[1; 2] to itself and A; B to some element in U. This implies a sequence of at least 3 elements, while s00 has length 2. In practice, we are not interested in all solutions to a template t, but only to those full lling certain interestingness criteria: Such criteria may refer to threshold values for the support, con dence and improvement of each g-sequence/solution, as described in subsection 2.2.2. Additional criteria may concern the properties of the elements of U that appear in the solution. In [SF98], we have described a language for specifying such criteria. The speci cation of interestingness criteria reduces the set of solutions to the \set of interesting solutions" I(t)  S(t).

2.4 The Solutions to a Template

We have thus far introduced the notion of generalized sequence and speci ed when a sequence in the log le L matches a g-sequence (Def. 6). Finding all sequences in L that match a given gsequence is a well-known string matching problem. However, in web log mining and in g-sequence discovery in general, the g-sequence is not known in advance. We must rather nd all g-sequences matching a template given by the user or generated by a template generator and satisfying some interestingess criteria. For those g-sequences, we then nd the sequences matching them. Our g-s equence discovery m iner (\GSM") extracts all g-sequences satisfying a set of userspeci ed criteria. Those criteria may concern statistical thresholds, e.g. minimum support, con dence or improvement, according to 2.2.2. Restrictions on the meta-data of the web pages are also supported. The structure of the g-sequences in terms of number of named variables and possible restrictions on the wildcards is speci ed in the template, as described in 2.3. Example 5: We want to discover all solutions to a template A ? B ? C, so that the support is at least 100 and the con dence at least 90%. We formulate our mining query in the MINT presented in [SF98]. NODE AS A B C, TEMPLATE A*B*C AS t WHERE C.support >= 100 AND C.support / B.support >= 0.9

Further criteria can be speci ed, e.g. on the url contents: ... AND B.url CONTAINS "Environment" AND B.url NOT CONTAINS "frame"

4 8

GSM processes an o -line aggregated version of the log le, extracts the g-sequences conforming to the criteria of the human mining expert and displays them in graph form. In section 3, we introduce the semantics of the Aggregated Log le input to GSM. Section 4 contains the analysis of the GSM itself. It is obviously of interest, whether already established mechanisms for the discovery of association rules and for conventional sequences can be used to this purpose. In section 5, we discuss the shortcomings of conventional miners, but show how they can be serve as preprocessors to the actual mining phase. This section can be read before or after sections 3 and 4.

3 Aggregating the Sequences of a Log File A log le may contain duplicates. Moreover, some sequences may have common pre xes. If we merge all common pre xes together, we transform the log into a tree, each node of which is annotated with the number of sequences having the same pre x up to and including this node. This tree contains the same information as the initial log. Hence, when we look for sequences matching a given g-sequence or satisfying a template, we can scan the tree instead of the original multiset. On the tree, a pre x shared among k sequences appears and gets tested only once.

3.1 The Notion of Aggregate Tree

An \aggregate tree" is the tree representation of a multiset of sequences, the common pre xes of which are merged. More formally:

De nition 14: Let U be the set of elements and X be a multiset of sequences from U . The \aggregate tree" T = agTree(X) of X is a tree of labeled nodes from a set V and edges from set E  V  V . The labels are pairs from the set (U [ f^g)  N named \element" and \support" respectively. The following conditions hold: 1. The label of the root root (T) of T is (^; jX j). 2. A tree branch is a sequence of tree nodes. The root is the parent of all tree branches but does not belong to any branch. 3. A vector tb 2 V  belongs to agTree(X) i : (a) There is an edge (root(T); tb[1]) 2 E. (b) There is a sequence s 2 X such that length(tb) = length(s) and tb[i]:element = s[i] for each i = 1; : : : ; length(tb). (c) For each i = 1; : : : ; length(tb), tb[i]:support is equal to the number of sequences in X that have the form s[1]  s[2] : : :  s[i]  y for some, possibly empty, subsequence y. 4. For each sequence s 2 X there is a tree branch tb such that the \element"-parts of the labels of tb constitute a sequence s0 having s as pre x. The agTree(X) is directed from the root towards the leaves. 2 In this de nition, the sequences in X have been mapped into tree branches. Condition 1 states that a tree branch is a sequence starting at but not including the dummy root and ending at a leaf node. Condition 2 describes how such a tree branch tb is built, namely from a sequence s in X. A node tb[i] in tb is annotated with the number of sequences starting with pre x (s; i): The number of those sequences is the support of pre x (s; i) in X according to Def. 7. Condition 3 states that all sequences in X are mapped into tree branches in that way.

9

(Element:a,Occurence:1),Support:21

(b,1),11

(e,1),11

(d,1),10

(b,1),10

(f,1),3

(a,1),21 s1. s2. s3. s4. s5. s6. s7.

a-b-e (8) b-d-b-c (2) b-c-e (7) a-b-e-f (3) a-d-b (10) b-d-b-e (4) b-e-f (1)

(^,1),35

(c,1),2 (d,1),6 (b,1),14

(e,1),4 (c,1),7 (e,1),1

Recorded sequences (number of accesses per sequence)

(b,2),6 (e,1),7 (f,1),1

The aggregate tree of an example Log

dummy node

Figure 1: Constructing aggregate trees

Example 6: At the left side of Fig. 1, we show a multiset X of sequences. To avoid repeating

each sequence, we list each one once, annotated with the number of times it is recorded in the multiset. At the right side of the gure, we show the agTree(X) accommodating those sequences. b is the rst event in the sequences s2; s3; s6 and s7. (b; 1) denotes the rst occurence of b; (b; 2) denotes a reoccurence in s2; s6. We retain page occurences explicitly on reasons of eciency. By adding up the appearances of the sequences s2; s3; s6; s7, we compute 14 as the support of (b; 1) as rst sequence element. Similarly, the support of (a; 1) as rst sequence element is 21. In s1; s4, event b has occurred after a; the respective aggregate tree node has a support of 11. In our example, we have replaced the set of elements U with set U 0 = U  N, in which each element of U is extended by a positive integer. This integer denotes the occurence of the element of U within a sequence, and is intended to atten cycles. We use the two sets U; U 0 interchangeably, unless we make an explicit remark about reoccuring elements within a sequence. 4

3.2 The Aggregated Log

The aggregate tree of the whole log le L is called \Aggregated Log" and forms the basis of our computations for g-sequence discovery. The log le itself is not used in the mining process. The reason is that the Aggregated Log has less nodes and useful, already aggregated information in their labels. The higher the number of common pre xes in the log, the lower is the number of nodes in the Aggregated Log. In web log les, common pre xes are very frequent, because a site has a xed number of entry pages, which are much more likely to be accessed rst than the others. Hence, the usage of the Aggregated Log during mining ensures that potentially much less nodes are accessed than those in the original log le. In [SF98], we describe how the Aggregated Log is built from a log le of web access entries. In short, the Aggregated Log is constructed by one scan over the log. After this initial scan, it grows by incrementally merging new sequences recorded in the log into the tree branches.

4 An Algorithm for g-Sequence Discovery Our g-s equence discovery m iner GSM operates on the Aggregated Log tree described in the previous section. Its goal is the discovery of g-sequences conforming to a template of variables (see subsection 2.3) and satisfying a number of user-de ned constraints such as statistical thresholds.

10

4.1 The GSM Algorithm

The input to GSM is a template t and a possibly empty list of predicates restricting the statistics and contents of the named variables in t. The GSM traverses the Aggregated Log once for each named variable in t and gradually constructs a tree T . The branches of T constitute the set of solutions, i.e. the set of g-sequences satisfying the template and full lling the criteria speci ed by the user in the MINT query. Since both the Aggregated Log and T are trees, we denote the nodes of the latter as \t-nodes" to avoid ambiguities.

De nition 15: Let X be a multiset and P (X) the set of multisets over X. A \t-node" is a tree node labelled with pairs from the set P (X) N. The rst label is named \content" and the second label is named \support". Input: A template t = t ? t ? : : :tn? and a possibly empty set of constraints on t ; : : : ; tn. Output: A tree T of solutions to t that satisfy the constraints. 1

2

1

Branches of t with length i < n constitute solutions to the rst i named variables of t. Procedure: \GSM(t,constraints)"

For i=1:

1. Scan the Aggregated Log to nd all nodes that \satisfy" t1 according to Def. 13. 2. Create one t-node  e  for all found nodes that refer to the same element e 2 U. We set  e  :content equal to the P multiset of those nodes. We set  e  :support equal to v2e:content v:support (recall Def. 14 for the names of the labels of aggregate tree nodes). 3. If the support of the new t-node does not satisfy the statistic thresholds speci ed for it, then discard it. 4. Place all remaining t-nodes below a dummy root ^, whose contents are not of interest. Those t-nodes form level1 of the output tree T . For i=2, : : : ,n: For each t-node  e  at leveli?1 of T :  For each node x of the Aggregated Log that belongs to  e  :content: 1. Scan the subtree of the Aggregated Log rooted at x to nd all path pre xes below x that satisfy ?ti . The end of each pre x is an element e0 matching ti. 2. Create one t-node  e0  for all nodes that match ti and refer to the same element e0 2 U. 3. If the t-node does not satisfy the statistic threshold to which it is involved, discard it. 4. Place all remaining t-nodes below  e . The t-nodes thus created form leveli of T .

Traverse T :

For each branch b of T : display graph(b) Figure 2: The GSM Algorithm

The GSM algorithm proceeds in constructing T from the Aggregated Log as shown in Fig. 2. The GSM builds the tree of solutions T in a breadth- rst way. It rst traverses the Aggregated Log to nd nodes satisfying the rst template variable. Nodes referring to the same element of set 11

U correspond to the same variable binding and are therefore put together, constituting one t-node of T . All t-nodes containing bindings of the rst variable form level1 . In each subsequent step, we do not need to scan the whole Aggregated Log. A binding of the ith variable is possible only in the subtrees below the nodes to which the (i ? 1)th variable has been bound. A given binding of this variable is represented by a t-node at leveli?1 . Hence, the GSM only needs to search the subtrees of the Aggregated Log nodes contained in this t-node. The last step of GSM is the presentation of all solutions in graph form. Before describing this algorithm, we explain the behaviour of GSM by means of an example.

Example 7: Assuming that the aggregate tree of Fig. 1, we use GSM as described in Fig. 2 to

nd all templates of the form X ?Y ?Z, for which the following conditions hold (in MINT syntax):

X.support >= 20 AND Y.support >= 10 AND Z.support >= 4

In order to draw a tree T of t-nodes that is not trivial and can t on a gure, we (a) restrict the support of each variable rather than the con dence among the variables and (b) we build at each level only the t-nodes satisfying the conditions and at most one violating them. The tree of solutions T is depicted in Fig. 3 and explained below. ^

(b,1),11 (b,1),10 (b,1),14

(a,1),21

(e,1),11

(b,1),11 (b,1),10

(e,1),11

(f,1),3

(f,1),3

(d,1),10

(b,1),10

(e,1),11 (e,1),4 (e,1),7 (e,1),1

(d,1),6

(f,1),3 (f,1),1

(e,1),11 (e,1),4 (e,1),7 (e,1),1

level1

(f,1),3 (f,1),1

level2

level3

Figure 3: A tree of g-sequences/solutions In the rst step, we build all t-nodes of level1 . For this, we traverse the Aggregated Log of Fig. 1 and build one t-node comprised of all nodes referring to the same element. From Fig. 1, we can see that there exist 7 distinct elements, resulting to the t-nodes  a; 1 ,  b; 1 ,  e; 1 ,  f; 1 ,  d; 1 ,  c; 1 ,  b; 2 . Each of them is built, tested against the predicate for variable X and discarded immediately if the predicate is not satis ed. So, only three t-nodes remain, those shown at level1 of Fig. 3. At this point we can see the advantage of using the Aggregated Log instead of the original log le: The t-node  a; 1  contains a single node of the Aggregated Log; to build it from the log le, we would have to merge 21 sequence elements. Similarly,  b; 1  is comprised of three nodes { the aggregates of 35 sequence elements in the log le. For level2 we traverse each subtree below a node belonging to a t-node of level1 . We can see that all other ancestors of each node appearing in a t-node of level2 can be arbitrary in number and content. For  a; 1  GSM builds four t-nodes, of which the one containing the appearances of (f; 1) after (a; 1) is rejected because its support is 4 and thus less than 10. For  b; 1 , four t-nodes are built and three of them are discarded. If Fig. 3 we show the one retained and one of those being discarded. 12

For  e; 1 , GSM builds only one t-node, since the only element appearing after (e; 1) is (f; 1). We can see from the gure that its support is lower than the threshold for variable Y and is therefore discarded. For level3 we proceed in a similar way. Here we show all t-nodes being built, including those rejected. We can see that there are t-nodes with the same content that are built more than once. This issue is further elaborated in the analysis below. 4 The tree of t-nodes T has as branches the solutions to the input template. Each solution is a g-sequence. Its graph representation is built by the algorithm \display graph()" shown in Fig. 4.

Input: A branch tb of T .

Its length is n. For i = 1; : : : ; n, tb[i] refers to the element ei 2 U. Output: A directed acyclic graph dag(tb). Procedure: For i = 1; : : : ; n ? 1: 1. Let Ti be an empty aggregate tree. 2. For each Aggregated Log node x in the t-node tb[i]: Insert into Ti each subbranch y of the Aggregated Log that starts at x and ends at a node in tb[i + 1]. 3. For each Aggregated Log node x in the t-node tb[i]: Insert into Ti each subbranch z of the Aggregated Log that starts at x, has a common pre x with a branch in Ti but does not belong to Ti (i.e. does not end at a node in tb[i + 1]). After this step, the root of Ti is a node (ei ; tb[i]:support). 4. Replace the leaf nodes of Ti with a pointer to the root of Ti+1 , which is to be built next. The last aggregate tree Tn is a single-node tree with root(Tn ) = (en ; tb[n]:support). Figure 4: The algorithm display graph()

Example 8: Continuing Example 7, we build the graph for one solution from T , namely the branch tb with tb[1] referring to (b; 1), tb[2] referring to (e; 1) and tb[3] referring to (f; 1). In Fig. 3, this branch is the one in the middle of T . The support values of its t-nodes can be computed by adding the supports of the individual nodes in each t-node: tb[1]:support = 35, tb[2]:support = 23 and tb[3]:support = 4. At the left side of Fig. 5 we show the branches contributing to each of T1 and T2 . Obviously, if two branches have a common pre x, this pre x is not considered twice. To demonstrate that, we show subtrees instead of separate branches. The graph produced by merging the three aggregate trees is shown in the lower part of Fig. 5.

4

A small enhancement of GSM. In Example 7, after building all t-nodes of level and removing 2

those not satisfying the conditions on the second variable, we have left one t-node of level1 without children. This implies that this t-node does not contribute to any solution for the second and third variable of the template. This observation can be generalized for templates with n variables. In particular, if a t-node at leveli has no children, it can be discarded from memory, when the construction of subsequent levels starts. Its removal makes its parent candidate for removal, too, in a recursive fashion. This activity can be incorporated as the last step of the loop on counter i in Fig. 2. 13

(b,1),11

(e,1),11

(b,1),14

(d,1),6 (c,1),7

T1 (b,2),6

(e,1),11

(e,1),4

(e,1),7

(b,1),35

(b,1),10

(e,1),11 (e,1),1

(d,1),6

(b,2),6

(c,1),7

(e,1),1

(e,1),4

(e,1),7

(e,1),1 (f,1),3

T2

(f,1),1

(e,1),23

(e,1),4

T3 (f,1),4

(f,1),4

(e,1),7

(d,1),6

(b,2),6

(b,1),35

(e,1),23

(f,1),4

(c,1),7

Figure 5: The graph representation of a g-sequence In particular, we start at a t-node ptr = e  of leveli that has no children and we perform the following loop, moving towards the root t-node:  While ptr has no children do 1. Go to the parent of ptr, say parent at the previous level. 2. Remove the t-node pointed to by ptr. 3. Set ptr = parent. This loop will improve space utilization by removing useless t-nodes at a small computational overhead. Obviously, if the expert is also interested in solutions for the rst i variables of t, solutions subject to removal must be copied to a temporary cache for later presentation before being discarded from T . GSM versus Apriori. The reader may have noticed that there are parallels between the GSM and the classic Apriori algorithm for the discovery of association rules [AS94]. The levels built by GSM are the counterpart of frequent datasets extracted by Apriori, where leveli corresponds to rules of length i, i.e. containing i items. However, there are two di erences in the nature of the problems adressed by the two methods. First, GSM must maintain order, since sequences are order-preserving, while association rules are order-insensitive. The impact of this fact we could see in Example 7: The input template has a solution (a; 1) ? (b; 1) ? (e; 1). Knowing this we cannot make any assessments about solutions that start with (b; 1) or (e; 1) and contain (a; 1) in another but the rst position. Thus, the search space of GSM is much larger than the search space of a miner discovering association rules. Second, GSM must generate g-sequences, i.e. cope with wildcards. For the order-insensitive rules, this is a trivial requirement. For sequences it is not. In particular, we can see the impact of wildcards in Example 7: some t-nodes are considered more than once in the same level or at di erent levels. This has no e ect on the correctness of the results: each t-node refers to the Aggregated Log nodes it should and only on them. However, considering the same branch or subtree multiple times a ects performance. To tackle this problem, we have designed an optimized version of GSM.

14

4.2 Optimizing GSM

The optimized version of GSM, GSM-E reduces the number of subtrees considered more than once, as follows: 1. We augment the nodes of the Aggregated Log with two properties, reused and available. The property reused is a counter with initial value zero, while available is a boolean ag initialized to FALSE. 2. At step i = 2; : : : ; n, we scan the subtree below each node x belonging to a t-node at leveli?1 . We look for bindings that satisfy ?ti . Each such binding ends at an Aggregated Log node y that matches ti . Then: (i) We add y to a (new or already existing) t-node of leveli , as before. (ii) We create an edge from x to y. We call this edge a v edge(x; y). (iii) We increase the value of y:reused by 1. We process all nodes in each t-node of leveli?1 in that way. 3. We then build the t-nodes of leveli , by processing each t-node  at leveli?1 as follows:  For each node x belonging to : We rst process each Aggregated Log node y in the subtree rooted at x as follows: (a) If y is appropriate as a binding for ti, we add y to a sorted list list(x) of nodes. (b) If the value of y:available is FALSE, then we build a sorted list(y) of nodes appropriate as bindings for ti. Otherwise, list(y) is already available. (c) We merge list(y) with list(x). (d) We decrease the value of y:reused by 1. (e) If y:reused is still above a threshold MIN REUSE, we set y:available to TRUE. Otherwise, we discard list(y). We nally sort-merge list(x) to build groups of nodes identical to each other and place them in t-nodes below . GSM-E attempts a better utilization of resources, by trading space for time. Lists of bindings are marked as reusable, so that they can be cached if they are reused quite often and if memory space is available. If enough space is available to keep each reusable list in cache as long as needed, GSM-E scans each subtree of the Aggregated Log only once. GSM-E is currently only applicable for templates with ?[0; +1]: A node z in a list(y) as described above is only reusable if both the branch yw1 z and xw2yw1 z satisfy ?ti . For a wildcard with an upper boundary, z may qualify as a binding with respect to y but not with respect to x. For a wildcard with a low boundary, z may qualify with respect to x but not with respect to y. Optimizations and tradeo s must be considered for the general case.

4.3 Analysis

We now compare the processing cost of discovering g-sequences with GSM over the Aggregated Log to the cost of processing the original log le directly in a sequential way. Let t = t1 ? t2 ? : : :tn ? be the input template. Let N = jLj be the number of sequences in L and let M be the length of the longest sequence, i.e. M = maxx2L flength(x)g. To estimate the number  that are solutions for t, we observe that each sequence x  of g-sequences length(x) matches for t1 ; : : : ; tn and hence with as many potential contributes with at most n solutions to t. By using the maximum sequence length M as an upper boundary, we compute an

15



 M upper limit as the number of combinations \M choose n", n . For the total number of gsequences, we multiplythis number  { the binomial coecient { with the total number of sequences M in the log, getting N  n . For each of those g-sequences, the sequences matching it must be found and put together to form the output graph. It should be noted that even sequences of length less than n contribute to each solution, by increasing the support of common pre xes. The construction of all solutions can thus be accomplished in one pass over the log le. Thus, the upper boundary to the sequential discovery of all solutions in L is:   ptime(L; n) = Mn  N 2 (1) This sequential mechanism is apt to optimizations. We can for instance keep track of identical sequences and thus avoid generating the same solution more than once. Keeping track of identical pre xes implies a larger overhead in tests and temporary storage. Finally, if the pre x of a solution violates a statistical threshold, we can automatically reject all solutions with the same pre x. However, computing the e ect of this enhancement is not necessary, since the purpose of our computation is a comparison with GSM, where this enhancement is already incorporated: It corresponds to the removal of t-nodes without children. GSM is applied on the Aggregated Log instead of the original log le. Since sequences with common pre xes are merged, the Aggregated Log has less nodes, even if its number of leaf nodes is still N. To compute the number of leaf nodes, we de ne the set Lset = fx1; : : : ; xN g consisting of the sequences in L enumerated in an arbitrary way to mask the duplicates. We now need a measure of the decrease in the number of nodes between the original log le and the Aggregated Log. For this, we assume that sequences are inserted in the Aggregated Log in their order of enumeration, so that a sequence xi 2 Lset appears in the Aggregated Log only if there is no sequence with a lower index number and the same content as xi . More generally, any pre x of xi appears in the Aggregated Log only if there is no sequence with a lower index number and the same pre x. Note that the ordering of sequences itself is not signi cant: a common pre x will appear in the Aggregated Log once and only once, independently of the enumeration we use. For each xi and for each j = 1; : : : ; length(xi ), we de ne:  , 9xi 2 Lset : i0 < i ^ prefix(xi ; j) = prefix(xi ; j) k(i; j) = 10 ,otherwise We now specify a function that returns the minimum position, after which a sequence xi contributes elements to the Aggregated Log: k(i) = minfj jj = 1; : : : ; length(i) ^ k(i; j) = 1g The function k(i) returns the rst position of the largest sux of xi that is not common to any sequences with index smaller than i. For our computations, we need rather a function returning the \degree of non-overlap" between xi and its predecessors: (2) K(i) = minfk(i) ? 1; M g where M is the maximum length of a sequence in L. When GSM traverses the Aggregated Log to build the solutions for template t, it accesses branches instead of individual sequences. However, we can equivalently say that GSM accesses a sequence xi , as far as it is indeed in the Aggregated Log, i.e. if and only if the degree of non-overlap for xi is less than M. For variable t1 , GSM can extract at most M ? K(i) bindings from xi . If K(i) = 0, then xi appears in the Aggregated Log as a whole and all its elements are considered as potential matches to t1 . At the other extreme, if K(i) = M, xi is identical to a sequence already considered: in that case, its elements do not need to be considered. 0

0

16

For variable tm with m > 1, GSM can extract at most M ? m ? K(i) bindings from xi. This is an upper limit. After binding variable tm?1 to position xi[j], GSM starts looking for a binding for tm at position maxfm; K(i); j + 1g. This is possible on a tree, but not in the sequential scan of a le of strings. Thus, the number of bindings extracted by GSM from each sequence xi 2 Lset can be upper bounded by:   (M ? K(i)  (M ? 1 ? K(i))  : : :  (M ? n ? K(i))  n!1 = M ?nK(i) This number is zero if M ? K(i) is less than n. Now, to calculate the total number of bindings, we sum the binomial coecient of all N sequences, whereby some sequences are skipped, because their degree of non-overlap is equal to M. To build the graph for each solution, GSM scans part of the Aggregated Log. An upper limit to the part being scanned is the number of Aggregated Log branches, i.e. the number of leaf nodes, LN. In fact, GSM scans less than LN branches: After binding variables tm?1 and tm to the nodes x; y of the same Aggregated Log branch, only the subtree rooted at x is considered and only until node y is found. Other subtrees are ignored. Hence, an upper limit to the processing time of GSM in nding the solutions for template t in the Aggregated Log is computed as:  N  X M ? K(i)  LN (3) ptime(agLog; n) = n i=1

Example 9: We consider the log le and the Aggregated Log of Fig. 1. First, we enumerate the

sequences listed in the left-hand side of the gure into: x1 : abe x11: bce x21: adb x2 : abe : : : bce : : : adb : : : abe x17 bce x30: adb x8 : abe x18: abef x31: bdbe x9 : bdbc x19: abef : : : x34: bdbe x10: bdbc x20: abef x35: bef The cardinality of the log is N = 35 and the maximum length is M = 4. The Aggregated Log has 6 branches, i.e. LN = 6. The degrees of non-overlap for the sequences can be easily calculated into the table below: K(1)=0 K(11)=1 K(21)=1 K(2)= : : : =K(8)=4 K(12)= : : : =K(17)=4 K(22)= : : : =K(30)=4 K(9)=0 K(18)=3 K(31)=3 K(10)=4 K(19)=K(20)=0 K(32)= : : : =K(34)=0 K(35)=1 For a template t with 3 variables, as in Example 7, the processing time limit for the log le is:  35  X 4  352 = 171; 500 pstime(L; 3) = i=1 3

By observing that some sequences have length less than 4, we can reduce this number to a better upper limit:     4 0 pstime (L; 3) = (9  3 + 26  33 )  352 = 75950 By recognizing identical sequences, this limit is further reduced to pstime00 (L; 3) = 19; 600 17

For the Aggregated Log, the upper limit is: pstime(agLog; 3) =

 35  X 4 ? K(i)  62 = 396 3 i=1

4 From Eq. 3 and from the example we can see that GSM performs best for sequences that are small or have a low degree of non-overlap. For long sequences, the binomial coecient tends to become the dominant factor. If the degree of non-overlap remains low, though, the negative e ects of this trend on performance remain limited.

5 Using Conventional Miners to Discover g-Rules The nature of the g-sequence discovery problem places it between the discovery of conventional association rules and the discovery of frequent sequences. However, none of those mining models has been designed to discover g-sequences. In this section, we describe methods of using a miner conforming to one of the two models for part of the g-sequence discovery process. In the following, we denote the mining model for AS sociation R ules' discovery as \ASR " and the S equential P atterns' discovery M odel as \SPM ".

5.1 Using ASR to Discover g-Sequences

A rule discovered by ASR has the form  : ! b, where is a conjunction of elements or negations of elements belonging to the set U and b 2 U. The rule  has:  support Support() := support( ), de ned similarly to Def. 2. ( ^b)  con dence Confidence() := confidence(b; ), de ned as the ratio support support( ) . Our goal is to guide the rule discovery mechanism of ASR to the construction of g-sequences.

Input log. The input le to ASR should be a set of transactions. The transaction is a sequence of the log le L, although, depending on the algorithm, we may need to observe a transaction as a

set of timestamped or otherwise ordered entries. In both cases, events occuring more than once in the same sequence must be augmented with their occurence number, so that they be not removed as duplicates. We still denote this input le as L.

A template for ASR. For g-sequence discovery, we specify a template t = t1 ?: : :tn?. Its named variables are subject to statistical thresholds of the form support(ti )  ci or confidence(tj ; ti )  c0i , where i < j. For ASR, the template implies that we are looking for rules having the form e1 ^ : : :^ ek?1 ! ek , where e1 ; : : : ; en 2 U. A condition support(ti )  ci on the template translates into a new condition support(e1 ^ : : : ^ ei )  ci and respectively for predicates on con dence. We invoke ASR to nd all groups of elements satisfying the above conditions. From an association rule to a template solution. A rule  produced by ASR is not necessarily a solution to the original template, not only because wildcards and restrictions on them cannot be formed, but foremost because ASR does not guarantee order: a conjunction e1 ^ e2 is supported by the sequences where e1 appears before e2 and by those where e2 appears before e1 . Hence, ASR discovers a superset of the solutions to the template. We can use this fact to invoke ASR as a preprocessing step to the g-sequence discovery phase. In particular, we can select from L the sequences corresponding to the rules discovered by ASR and use this smaller log L0 as input to our own mechanism for g-sequence discovery, as described in sections 3 and 4. 18

Extending ASR? The ASR model shows two shortcomings for g-sequence discovery: it does

not recognize ordering and, as a consequence, it cannot treat wildcards. To amend for the rst shortcoming, we must either extend the miner or augment the sequences with information that re ect their ordering in a way the ASR can understand. There are simple solutions, like augmenting each element of a sequence with its position in the sequence. This however presupposes that the ASR can understand a predicate requiring that the position of an element should have a larger value than the position of another element, whereby none of the two positions can be given explicitly. To the best of our knowledge, miners of the ASR model are not designed with such functionalities in mind. More sophisticated solutions are thinkable, like extending the sequences with an encoding of the positions of their elements. Then, a rule must be extended by boolean expressions describing the position of an element in a sequence. First of all, it is still questionable, whether this encoding can be exploited without explicitly restricting the positions of the elements. Moreover, the dramatic increase of rule length and the inclusion of negations make such an encoding prohibitive.

5.2 Using SPM to Discover g-Sequences

A conventional miner discovers classical sequences (i.e. trails) satisfying some statistical property, usually dominance. The principle followed by miners for sequential pattern discovery [AS95, MT96] lays in the stepwise construction of longer and longer sequences, as long as the statistic properties of the sequences being built satisfy the threshold values set by the human expert. It seems possible to use SPM as a preprocessing lter and reduce the original log into frequent sequences only. However, we may loose solutions in that way. To see why, we present a small example.

Example 10: Let U = fa; b; c; dg be a set of elements, e.g. web pages and L be a log le with

two distinct trails, abc appearing 10 times and adc appearing 5 times. For the g-sequence g = a?c we can see that the support of both a and c according to Def. 7 is 15. The support computation re ects the fact that the subsequences matching a wildcard (here: b and d) are not of interest. A conventional SPM miner does not recognize wildcards. Hence, if it instructed to nd sequences with support at least 12, it will return only the sequence with the single element a. Neither b nor d have adequate support to be taken into account. So, it is not recognized that those two trails meet at c again. 4 Hence, we cannot use SPM to lter out non-frequent sequences before starting with the gsequence discovery process. However, we can use SPM for templates that have no wildcards at their beginning. In particular, consider a template t = t1 ? [0; 0]t2 ? [0; 0] : :: ? [0; 0]tk ? : : : ? tn , i.e. e ectively having no wildcards prior to variable tk . We can then use SPM to nd all subsequences e1 : : :ek that satisfy the statistical constraints over t1; : : : ; tk . Those subsequences are pre xes of the sequences that may satisfy the whole template. We must therefore extract from L the complete sequences containing those subsequences and build a new smaller log L0, on which g-sequence discovery can be performed, as described in section 4.

5.3 When Should Conventional Miners be Used?

The procedures described for the discovery of g-rules using ASR or SPM reveal that both types of miners can discover rules, respectively sequences, which can occasionally be used as input to the actual discovery phase. The ASR model has the advantage of not being a ected by the existence of wildcards and of permitting the generation of rules of any length. Its disadvantate is the insensitiveness to order, so that the rules it generates are less restrictive than the desired g-sequences. The SPM model has the advantage of producing sequences and thus guaranteeing order. Its disadvantage is the insensitiveness to wildcards, so that only parts of the desired sequences can be constructed. 19

The decision on whether the ASR or the SPM is more appropriate as a preprocessor depends on the types of rules desired. If the input template contains wildcards at the beginning, ASR should be used. If there is at least part of the template that consists of adjacent named variables, SPM can be used to discover the subsequences containing them.

Resumee. We have shown that g-sequences cannot be discovered solely by miners based on the

ASR or the SPM model. Although such miners can be invoked as preprocessors to reduce the size of

the input, the g-sequence discovery itself requires a dedicated mechanism, presented in section 4. This mechanism is applied on an aggregated form of the original log le, the Aggregated Log described in section 3.

6 Conclusions In this study, we have presented a formal model for the discovery of navigational patterns from the web. We modelled a graph navigation pattern as a generalized sequence, the g-sequence, and we introduced the notion of template, as the basis for the discovery of g-sequences in a web log. For the discovery process, we have provided an analytical procedure, which we support in our web utilization miner WUM. Given the existence of many excellent miners in the public domain and in the commercial market, we have considered the possibilities of exploiting them instead of building new software. To this purpose, we studied the capabilities of the two mining models most close to the problem at hand against the requirements posed by the problem itself. We have found that, unless substantial extensions are made to the models, g-sequences cannot be discovered. On the other hand, the existing models can assist in improving the eciency of the dedicated miner by reducing the search space. This study on the formal model of g-sequence discovery is only one step in the direction of assessing knowledge from the web hypernetwork. Many technical improvements are necessary, since g-sequences are much more expensive to discover than conventional association rules or sequences. Much theoretical work is also needed to produce models for quality veri cation and exploitation of the results. In the area of web log mining, we also see a major challenge in the proper exploitation of this knowledge from the human experts. The navigational behaviour of users in a hypernetwork is vital for the hypernetwork's survival and full llment of its goals. The web has opened new opportunities in areas like distance learning, computer-supported cooperative work and information dissemination in large heterogeneous information systems. Proper exploitation of mining technology can assist in their success.

Acknowledgment. The author wishes to thank Lukas C. Faulstich (Dept. of Computer Science, FU Berlin) for numerous useful comments in the establishment of this theory and for his contribution in the WUM project. She also thanks Karsten Winkler, diplom student in the HumboldtUniversity Berlin for his work on the implementation of WUM and the design of the user interface and the persistent storage manager. Finally, she is grateful to the anonymous referees for many useful comments and a nice example suggestion.

References [AEF+ 98] Yonatan Aumann, Oren Etzioni, Ronen Feldman, Mike Perkowitz, and Tomer Shmiel. Predicting event sequences: Data mining for prefetching web-pages. In submitted to KDD'98, Mar. 1998. [AS94] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In VLDB, pages 487{499, 1994. 20

[AS95] [ATS93] [BL97] [CMS97a]

[CMS97b] [CPY96] [MT96] [PE98] [PPR96]

[SA96] [SF98] [SFW98] [TG97] [Wan97] [Wex96]

Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In ICDE, Taipei, Taiwan, Mar. 1995. Rakesh Agrawal, Imielinski T., and Arun Swami. Mining association rules between sets of items in large databases. In SIGMOD'93, pages 207{216, Washington D.C., USA, May 1993. Michael J.A. Berry and Gordon Lino . Data Mining Techniques: For Marketing, Sales and Customer Support. John Wiley & Sons, Inc., 1997. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Grouping web page references into transactions for mining world wide web browsing patterns. Technical Report TR 97-021, Dept. of Computer Science, Univ. of Minnesota, Minneapolis, USA, June 1997. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Web mining: Information and pattern discovery on the world wide web. In ICTAI'97, Dec. 1997. Ming-Syan Chen, Jong Soo Park, and Philip S Yu. Data mining for path traversal patterns in a web environment. In ICDCS, pages 385{392, 1996. Heikki Mannila and Hannu Toivonen. Discovering generalized episodes using minimal occurences. In KDD'96, pages 146{151, 1996. Mike Perkowitz and Oren Etzioni. Adaptive web pages: Automatically synthesizing web pages. In submitted to AAAI'98, 1998. Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow's ear: Extracting usable structures from the web. In CHI'96 (http://www.acm.org/sigchi/chi96/proceedings), Vancouver, Canada, April 1996. Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and performance improvements. In EDBT, Avignon, France, Mar. 1996. Myra Spiliopoulou and Lukas Faulstich, C. WUM: A Tool for Web Utilization Analysis. In EDBT Workshop WebDB'98, Valencia, Spain, Mar. 1998. Springer Verlag. extended version to appear in LNCS. Myra Spiliopoulou, Lukas Faulstich, C., and Karsten Winkler. Discovering Interesting Navigation Patterns over Web Usage Data. Technical report, 1998. Linda Tauscher and Saul Greenberg. Revisitation patterns in world wide web navigation. In CHI'97, Atlanta, Georgia, Mar. 1997. Ke Wang. Discovering patterns from large and dynamic sequential data. Intelligent Information Systems, 9:8{33, 1997. Alan Wexelblat. An environment for aiding information-browsing tasks. In Proc. of AAAI Spring Symposium on Acquisition, Learning and Demonstration: Automating Tasks for Users, Birmingham, UK, 1996. AAAI Press.

[ZXH98] Osmar Zaane, Man Xin, and Jiawei Han. Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. In Advances in Digital Libraries, pages 19{29, Santa Barbara, CA, Apr. 1998.

21

Suggest Documents