sets while considering two types of constraints, namely anti- monotone and .... cookie value, exclude all other records without both authen- tication and cookie.
Submitted to WWW 2006 - Paper ID 527
Constraint-Based Mining with Visualization of Web Page Connectivity and Visit Associations Jiyang Chen, Mohammad El-Hajj, Osmar R. Za¨ıane and Randy Goebel Department of Computing Science University of Alberta Edmonton, Alberta, Canada T6G 2E8 {jiyang, mohammad, zaiane, goebel}@cs.ualberta.ca
ABSTRACT The use of association rule mining carries the attendant challenge of focusing on appropriate data subsets so as to reduce the volume of association rules produced. The intent is to heuristically identify “interesting” rules more efficiently, from less data. This challenge is similar to that of identifying “high-value” attributes within the more general framework of machine learning, where early identification of key attributes can profoundly influence the learning outcome. In developing heuristics for improving the focus of association rule mining, there is also the question of where in the overall process such heuristics are applied. For example, many such focusing methods have been applied after the generation of a large number of rules, providing a kind of ranking or filtering. An alternative is to constrain the input data earlier in the data mining process, in an attempt to deploy heuristics earlier, and hope that early resource savings provide similar or even better mining results. Here we consider potential improvements to the problem of achieving focus in web mining, within the general task of mining web data to understand web page connectivity. Within this framework, we investigate both the articulation and deployment of rule constraints to help achieve focus and reduce computational resource requirements, as well as present novel visualization techniques to evaluate the value of our rule constraints.
Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining; H.5.2 [User Interfaces]: Graphical User Interfaces; H.4.m [Information Systems]: Miscellaneous; I.5.2 [Design Methodology]: Pattern Analysis
General Terms Algorithms
Keywords Constraint-based mining, web mining, visualization
1.
INTRODUCTION
Anyone with experience in searching for trends in large data volumes will understand that it is easier to find “what you’re looking for” when you know what factors contribute Copyright is held by the author/owner(s). WWW2006, May 22–26, 2006, Edinburgh, UK. .
to those data artifacts you seek. It is also unlikely that, having deployed any association rule mining system, one has not further considered what might be provided before re-running the mining system in order to produce fewer and more “interesting” if not more relevant outputs. At least one component of our interest is that of identifying any methods which might consistently improve the effectiveness of reducing the expenditure of computational resources, and increase the value of data mining results. In addition to the problem of knowing what kind of knowledge might heuristically improve focus in reducing resource consumption and improve results during data mining, there is the problem of managing the evaluation of such methods. With typical association rule mining systems producing sets of rules well beyond the ability of any direct human interpretation, it is important to provide sensible and intuitive methods of rule set evaluation. In addition to our interest in determining the content and form of methods to constrain data mining input to produce improved results, we also deploy our own method of visualizing association rules, in order to provide high level evaluation of our methods for constraining association rule mining inputs. We developed a web knowledge visualization and discovery system to display web graphs, and coupled that system with an algebra consisting of unary and binary operators to manipulate and create web graphs [6]. Our web graphs are 2D disk graphs of web connectivity overlaid with strata of various information related to web pages, hyperlinks and their traced navigation. Our initial prototype also comprised layers containing patterns over these graphs, mainly classes, clusters and association rules. It became quickly apparent that the sheer number of discovered association rules made it difficult to display and hence focus on relevant navigational patterns. Increasing the support threshold leads to overlooking rare but pertinent patterns, while reducing the support threshold yields to an “explosion” in the number of patterns. The discovery of these patterns becomes problematic, not only because of the huge number of discovered rules, but also because of performance: high memory dependencies, huge search space, and massive I/O operations. To reduce the effects of these problems we devised a new method with a fast traversal technique which drastically reduces the search space [24]. Using constraints that reduce the output size whilst directly discovering patterns that are of interest to the user, is germane to the problem at hand: focusing on a subset of patterns to visualize and analyze. In this paper, we briefly describe our visualization and
visual web data mining application WEBVIZ and explain the notion of web graph with its layers of data. We also introduce our efficient algorithm for mining frequent itemsets while considering two types of constraints, namely antimonotone and monotone constraints, during the mining process. Finally, we apply our algorithm on web navigational data with constraints defined on information collected in the layers of the web graphs to generate focused associations between web page visits. We propose an approach to overlay the discovered patterns on the web connectivity graph for an in-context visualization and analysis. The remainder of the paper is organized as follows: To explain the context of our contributions, we describe in Section 2 our visualization and mining system. We detail our web graph data-structure to clarify the various information that can accompany our web connectivity graphs in Section 3. We study two types of constraints and elaborate on how they can be considered during the mining of frequent itemsets in Section 4. In the same section we introduce our constraint-based algorithm and show experimental results. We reveal our approach for association rule in-context visualization and illustrate some examples in Section 5. Finally we present pertinent related work in Section 6 before concluding in Section 7.
2.
OVERALL ARCHITECTURE FOR DEPLOYING RULE MINING HEURISTICS
The constraint-based association rule mining and the visualization approach proposed in this paper aim at both reducing computational resources and focusing on more potentially interesting information in order to improve the value of the discovered association rules. Our approach is based on the framework of our visual web data mining application WEBVIZ and the notion of web graph with its data layers. In the following, we first briefly introduce the visualization system together with its object concepts and then present the overall framework architecture. While there are many algorithms and mining tools to discover useful patterns from web usage logs, visualization tools are important because these patterns are usually too complex and numerous to analyze and explain without their web structure context. To better understand web site structure, web usage data, navigational patterns and the consequences of possible changes and events on a web site, it is paramount to have a visualization tool to represent the discovered patterns and display them over a well-defined and comprehensive web structure depiction. Our preliminary visual web mining system, WEBVIZ, is built for such purpose. Furthermore, we believe that visualization itself is not enough and should incorporate means of data mining other than simple interactive manipulation like zoom-in and zoom-out. Therefore, the system also uses an algebra [6] to operate on web graph objects to highlight interesting characteristics of web data, and consequently improve the discovery process of new patterns and implicit useful facts in web navigational data. Technically, WEBVIZ is designed based on a clientserver model. The server holds all usage data, discovered patterns and web structure data. A client can connect to the server, select meta-data description files to load data for visualization. In this way, users can personalize their own data representation without impacting others or changing the source data. After loading the visualization, the user
Web Access Log Web Site Preprocessing
Usage Data
Data Mining Method
WEBVIZ Server
Web Image
Knowledge Pattern Information Layers
Web Graph Algebra Operations
Web Graph
WEBVIZ Client
Ad−hoc Visual Mining
Figure 1: WEBVIZ System Architecture
can apply operators in an ad hoc investigative manner, to generate more specific and meaningful graphs based on their understanding of the displayed visualization. The object we use in WEBVIZ both for visualizing the web information and for expressing mining operations is a web graph. It is a multi-tier object that combines all necessary data regarding the web structure, usage data and discovered patterns. The web connectivity context, called web image, is represented by the background tier and is generated as a tree by the depth first search algorithm rooted at a given page as explained in Section 5. Each other tier, called an information layer, represents either pre-processed web usage statistics, e.g, page visits, or discovered patterns, e.g, association rules. Each layer is depicted with different visual cues and can be inhibited or rendered when the object is displayed, allowing the localization of any information visa-vis the web site. We believe the idea of layering data and ` patterns in distinct layers on top of a representation of the structure of the web is useful, because it allows the display of information in context, which is more suited for pattern discovery and interpretation. The architecture of our system WEBVIZ is illustrated in Figure 1. Given a web site, usage data can be extracted after cleaning and filtering web access logs. A web image is then built using the structure of the site and, where appropriate, the pre-processed web access log [6]. The data collected by pre-processing is used both for creating information layers and as input to various data mining modules. The knowledge patterns, e.g association rules, discovered by such data mining can also be converted into information layers. A web graph can then be generated with the context structure and various layers of data. In WEBVIZ, all usage data, log files and discovered pattern are stored in the server side. Users can apply several mining operations in the client side and are able to create new web graphs to highlight particular navigational patterns.
3.
WEB PAGE CONNECTIVITY PROBLEM
3.1 Web Page Connectivity Data The web usage data we used for mining is generated from monthly server-produced access logs of our department website [25]. The data preparation consists of crawling the website structure, cleaning irrelevant records, and breaking the access log into user sessions. In order to cut usage records into sessions, we identify users by their authentication or cookie value, exclude all other records without both authentication and cookie. The session timeout is arbitrarily set to be 30 minutes. Our interest here is in log attribtues that include page visit frequency, page average view time, link usage frequency, session user IP address, etc.
3.2 Interpreting Association Rules on Web Page Connectivity The problem of mining association rules consists of finding associations between items or itemsets in transactional data. The data is typically retail sales in the form of customer transactions, but can be any data that can be modeled into transactions. For web access data, click-stream visitation is modeled by sets of transactions. The main and most expensive component in mining association rules is the mining of frequent itemsets. Formally, the problem is stated as follows: Let I = {i1 , i2 , ...im } be a set of literals, called items. Each item is an object with some predefined attributes such as size, age, duration, time, etc. and m is considered the dimensionality of the problem. Let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. A transaction T is said to contain X, a set of items in I, if X ⊆ T . A constraint ζ is a predicate on itemset X that yields either true or false. An itemset X satisfies a constraint ζ if and only if ζ(X) is true. Constraints are typically imposed on the predefined attributes describing the items. An itemset X has a support s in the transaction set D if s% of the transactions in D contain X. Two particular constraints pertain to the support of an itemset, namely the minimum support constraint and the maximum support constraint. An itemset X is said to be infrequent if its support s is smaller than a given minimum support threshold σ; X is said to be too frequent if its support s is greater than a given maximum support Σ; and X is said to be large or frequent if its support s is greater or equal than σ and less or equal than Σ. An association rule is an implication of the form “X ⇒ Y ”, where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. The rule X ⇒ Y has a support s which is the probability that X and Y hold together in D, and has a confidence c, which is the conditional probability that the consequent Y is true under the condition of the antecedent X.
3.3 Evaluating Web Page Connectivity Association Rules The association rules we are creating are intended to provide a summary representation of web page connectivity based on a web site’s general topology and the usage indicated in web access logs. To know whether one set of such rules is preferred over another, we would need to be able to construct some measure of “goodness.” In fact we understand, as mentioned above in the introduction, that the quality of rules, quality of constraint methods applied to derive rules, and even the alternatives to visualization
are not independent. In this regard, we believe that our experiments with the combination informal web connectivity assessments with the visualization tool will reveal which aspects of web connectivity, constraint application, and visualization are relevant to more precise evaluation.
4. CONSTRAINT-BASED MINING It is known that algorithms for discovering association rules typically generate an overwhelming number of those rules. While many new more efficient algorithms have been recently proposed for the mining of extremely large datasets, the problem of large numbers remains. The set of discovered rules is often so large that they have no value in identifying trends in the data from which they were mined. Various measures of interestingness and filters have been proposed to reduce the volume of discovered rules, but one of the most realistic ways to find only those interesting patterns is to express constraints on the rules we want to discover. However, filtering the rules post-mining adds a significant overhead and misses the opportunity to reduce the search space using the constraints. Ideally, application of constraints should be done as early as possible during the mining process.
4.1 Categories of Constraints Different types of constraints have been identified in the literature [14]. Here we discuss two important categories of constraints – monotone and anti-monotone. Definition 1 (Anti-monotone constraints) A constraint ζ is anti-monotone if and only if an itemset X violates ζ, so does any superset of X. That is, if ζ holds for an itemset S then it holds for any subset of S. Many constraints fall within the anti-monotone category. The minimum support threshold is a typical anti-monotone constraint. As an example, sum(S) ≤ v(∀a ∈ S, a ≥ 0) is an anti-monotone constraint. Assume that web-pages A, B, and C have an average visit time of 10, 15, and 5 minutes respectively. Given the constraint ζ = (sum(S) ≤ 20 minutes), then since the page set AB, with a total average visit time of 25 minutes violates ζ, there is no need to test any of its supersets (e.g. ABC) as they de facto also violate the constraint ζ. Definition 2 (Monotone constraints) A constraint ζ is monotone if and only if an itemset X holds for ζ, so does any superset of X. That is, if ζ is violated for an itemset S then it is violated for any subset of S. The maximum support threshold is a typical monotone constraint. Another example of a monotone constraint is sum(S) ≥ v(∀a ∈ S, a ≥ 0). Using the same web-pages A, B, and C as before, and with constraint ζ = ( sum(S) ≥ 35 minutes ), then knowing that ABC violates the constraint ζ is sufficient to know that all subsets of ABC will violate ζ as well.
4.2 Constraining Connectivity Rule Solutions The first step in mining for constrained association rules is to select which of the existing state-of-the-art algorithms can handle the job better than others. Our decision was to push both types of constraints early in the mining process because this is the most efficient way to get the answers. Most proposed algorithms that handle constraints during the frequent itemset mining process either deal with only anti-monotone constraints, in the case of bottom-up
approaches, or only monotone constraints, in the case of top-down approaches. The only known algorithm that can consider monotone and anti-monotone constraints simultaneously is dualMiner[4]. However, dualMiner has significant practical limitations due to its modest performance and lack of frequency calculation for all items. Efficiently finding the support for any itemset (even for ones that do not satisfy the constraints) is critical for generating association rules. For example, if itemset ABC satisfies the monotone constraints then there is a possibility that itemset AB would not satisfy those same constraints. In such cases most algorithms would not even generate and test them. However, when generating association rules, the support of AB is critical for finding the association rules in which ABC is involved. Our algorithm BiFold-Leap is capable of bidirectional pushing of constraints [9], and can consider both types of constraints more effectively during the mining process. It also finds the support for any itemset efficiently using a special data structure that encodes the supports of all frequent patterns. The BiFold-Leap algorithm is based on a novel search space traversal strategy called Leap-Traversal [24]. Using this strategy we can jump within the search space, looking for frequent patterns, and generating a minimal set of candidates for potential itemsets. This approach has the advantage of finding both long and short patterns easily, as opposed to bottom-up approaches which favour short frequent patterns, or top-down approaches which favour cases with long frequent patterns. Our approach simply jumps within the search space from short to long candidates to generate a set of special patterns called maximals. These present the longest frequent patterns in which all their subsets are also frequents. In other words, a frequent itemset is said to be maximal if there is no other frequent itemset that subsumes it. The approach also quickly generates the support of all the subsets of these maximal patterns. Simply put, the Leap approach first creates a frequent pattern FP-tree [10]. From this prefix tree with frequent sub-transactions, unique sub-transactions, called frequent path bases (FPB), and their counts, called called branch support, are obtained. Infrequent FPBs are intersected iteratively, producing subsets which may be frequent. When an intersection of infrequent FPBs results in a frequent itemset, that itemset is declared as a maximal candidate and is not subject to further intersections. When an intersection is infrequent, it participates in further intersections looking for maximals. This is indeed how the leaps in the lattice are done. The result of the intersection of FPBs indicates the next node to explore. Support counting for the frequent itemsets is done by summing the branch supports of all FPBs which are supersets of the pattern. We extended this leap approach by pushing constraints during the jumping process in such a way that if a tested node does not satisfy the anti-monotone constraints then we do not jump upward any more as all its supersets do not satisfy the same anti-monotone constraints. The same idea is applied for monotone constraints, but instead we do not jump downward because all the subsets of an itemset that do not satisfy a monotone constraint also do not satisfy it. The conjunction of all anti-monotone constraints comprises a predicate that we call P (). A second predicate Q() contains the conjunction of the monotone constraints. Pushing P () and Q() starts by defining two terms which are head (H) and tail (T ) where H is a frequent path base or
any subset generated from the intersection of frequent path bases, and T is the itemset generated from intersecting all remaining frequent path bases not used in the intersection of H. The intersection of H and T , H ∩ T , is the smallest subset of H that may yet be considered. In this way, Leap focuses on finding frequent H that can be declared as local maximals and candidate global maximals. The algorithm is also capable of applying fewer predicate checks because if it detects that a candidate itemset satisfies the anti-monotone constraints, it does not test all its subsets as they all satisfy the constraints; the same idea is applied in reverse order for the monotone constraints. Four pruning strategies are used to reduce the intersections between nodes which are: 1. If an intersection of frequent path bases (H) fails Q(), then H can be discarded; 2. If an intersection of frequent path bases (H) passes P (), it is a candidate P-maximal (i.e. a local maximal that satisfies P ()), and there is no need to evaluate further intersections with H; 3. If a node’s H ∩ T fails P (), the H node can be discarded, and there is no need to evaluate further intersections with H; 4. If a node’s H ∩ T passes Q(), Q() is guaranteed to pass for any itemset resulting from the intersection of a subset of the frequent path bases used to generate H plus the remaining frequent path bases yet to be intersected with H. Q() does not need to be checked in these cases. Ultimately, the generated set of maximals satisfy both types of constraints and no relevant maximal is missed. Generating all subsets from this set and then the relevant association rules is the last step in the mining process. Only Q() needs to be checked whenever a frequent subset is generated from a maximal. If there is violation by a given itemset, no subsequent subsets of it are further generated.
4.3 Experiments in Constraining Association Rule Mining To illustrate the effectiveness of our approach, we report experiments applied on two data sets: the web requests for the month of March and for the month of April for our departmental web site. The experiments depicted herein are run with a minimum support of 0.01% and a minimum confidence of 1%. Using these two thresholds the month of March and the month of April generated respectively 10,985 and 13,543 association rules with no other constraint. It is obvious that visualizing all these rules for analysis is impractical. Constraints were later added for better focus. Figures 2 and 3 present the number of rules generated without constraints, and with some specific constraints. Table 1 presents the type of monotone and anti-monotone constraints used in our experiments. We first report the reduction by rule size constraint. We forced the rule size to be between 3 and 6 pages. This decreases the number of rules to 6,754 and 4,681 respectively for March and April. We further applied constraints on two web page attributes namely the number of visits per page and the average visit time per page. To examine the selectivity of some constraints we applied separately the constraints sum(), max() and min() monotonically and
monotone min(S) ≤ v max(S) ≥ v count(S) ≥ v sum(S) ≥ v(∀a ∈ S, a ≥ 0) support(S) ≤ v
anti-monotone min(S) ≥ v max(S) ≤ v count(S) ≤ v sum(S) ≤ v(∀a ∈ S, a ≥ 0) support(S) ≥ v
B
A
Table 1: monotone and anti-monotone constraints used in our experiments B applied on visit number
12000
applied on average visit time
C
10985 9774
10000 # of generated AR
applied on both
March 9053
8000
6754
6000
A
8721
D
6552 5689 4117
4000
3819
A
2000
1034
698 18
0 support only
support and rule size
supoprt and sum()
support and max()
support and All constraints min()
Constraints
Figure 4: Association Rule Visualization in Context Figure 2: Number of patterns on March dataset, Support = 0.01%, Confidence = 1%
anti-monotonically. For Figures 2 and 3 the constraints are as follows: the sum of average visit time for all nodes is set to be between 120 to 1000 seconds, and the summation of average number of visits for all pages in one rules is set between 1000 and 5000; the maximum visit time for one nodes is limited between 30 to 240 seconds and for average number of visits between 1000 and 2000; the minimum is set to be between 5 to 60 seconds for average visit time for one page and for the average number of visits between 10 and 200.
5.
ASSOCIATION RULE VISUALIZATION
After discovering association rules using our constraintbased mining method, displaying those rules within the web page connectivity context becomes essential to take advantage of human visual capabilities in identifying valuable knowl-
applied on visit number
applied on average visit time
April
16000 14000
13534 12271
# of generated AR
applied on both
12300
11250
12000 10000
7797 6757
8000 6000
6861
6494
4681
4000 1252
2000
955 72
0 support only
support and rule size
supoprt and sum()
support and max()
support and All constraints min()
Constraints
Figure 3: Number of patterns on April dataset, Support = 0.01%, Confidence = 1%
edge patterns. In this section, we present our visualization approach to analyze association rules in their structure context, by overlaying discovered patterns on the web connectivity graph. We also show our preliminary work on comparing and visualizing association rules of different time periods. We adopt the Disk-tree [7] representation to generate the web image, in which a node represents a page and an edge indicates the hyperlink that connects two pages; the root, a node located in the center of the graph, is usually the home page of a site or a given starting page. A disk-tree is a concentric circular hierarchical layout which has a root node in the middle of the circles. The nodes in the outer circles are children of the inner circle’s nodes. This means that the corresponding pages of the outer nodes can be connected from that of the inner nodes. We use a Breadth First Search(BFS) method to convert the connected graph into a tree structure by placing a node as closely to the root node as possible. For each page, only one incoming link can be represented and BFS picks the first link following the scan of the web topology. A usage-based disk-tree is also possible [6]. Figure 4 is an example of our association rule visualization. It shows rules whose support > 0.2% within two levels of the disk-tree in one month usage record of our department website. In the graph, association rules are overlaid as an information layer, using visual cues such as edge colours and sized arrow heads, on the web site topology structure context. For each rule, one colour is used to draw both the antecedent and consequent nodes. The size of the node represents the visit frequency of the corresponding page. The colour of the edge indicates the confidence of the rule and the size of the arrow head represents the support of the rule. For rules that have multiple antecedent or consequent items, edges between any two pages are connected and display the same value. There are four types of rules visualized within the web graph context.
Antecedent
Antecedent Consequent Consequent
March Graph
April Graph
Figure 5: Association Rule Selection • Parent-Child rule (Rules A in Figure 4). Parent-Child is the normal type of web association rules. The visit of the parent page is supposed to lead the user to the child pages. Most of the showing rules are of this type. • Child-Parent rule (Rules B in Figure 4). Child-Parent rules are also common. Considering the fact that many pages do have links that connect to their parent page. • Cross-Level rule (Rules C in Figure 4). Cross-Level rules show possible design flaws that deserve further investigation. Cross-Level means the user jumps to the child of one page after viewing the parent of that page, without visiting the ones between these two ends. A Cross-Level rule with high support and confidence means there are other popular ways for users to get to the destination page but the designed path, e.g., many professors put a link to graduate application on their home page, which is heavily used. • Cross-Tree rule (Rules D in Figure 4). Cross-Tree means the user jumps from one subtree of the web graph to another. Cross-Tree rules also potentially indicate design problems and thus merit additional examination. A high support and confidence rule tells us that other possible connections into the destination page are more popular than the designed one, e.g., there are two path links to a course’s homepage, one is from the department course catalog and the other is from one faculty’s homepage. The rule for the course catalog path is represented as normal ones in the graph but the other path rule is shown as Cross-Tree rule and is much more used. Web administrators need to look into Cross-Level and Cross-Tree rules and check the visualized usage behavior so as to improve the site structure. Our constraint-based mining method assist in reducing the number of association rules (i.e. the number of rules to visualize). However, to better focus on more relevant patterns, one should be able to select rules to visualize by their features, e.g., antecedent or consequent parts. By such rule selection, a web administrator can concentrate on one specific page and analyze the related rules that represent the navigational usage of that page. In Figure 5, the left figure shows rules that have the same antecedent part, thus showing the pages that are most visited after viewing the antecedent page; the right figure shows rules that have the same consequent part, thus identifying pages that are most
Compare Graph
Figure 6: Association Rule Evolution Comparison
frequently navigated before viewing the consequent part (i.e. popular referers). An analyst can take advantage of the web topology structure background to better understand possible context relationships among these pages involved in the visualized rules. While investigating association rule graphs based on one original web usage and connectivity data set to explain user behaviour in that particular time slot, we can also manipulate graphs from different data sets, separated by time periods, to visualize the usage difference between comparable time slices. For instance, it is interesting to check rules that are remain popular across time, rules that are used to be important but fade out, or rules that just appear. For example, we can expect rules representing the connection between a course home page and project description page to appear, or rules indicating the relationship between course registration page and catalog page to fade out, in the last month of the semester. In Figure 6, the operator difference is applied on two months’ rules, whose support > 0.2% and have the same antecedent page in two consecutive datasets. In the difference graph, green lines indicate “fading out” rules and red lines represent the rules that are still popular or even become more heavily used. The size of the arrow head is decided by the difference of support values of the same rule in different time period. We are not limited to difference comparison only. By similar operations and the visualization process, we are able to distinguish similarity, appearance and disappearance of navigational behaviour, etc. to better understand and explain the evolution of association rules within the web usage context. We believe our approach of layering association rules on top of a web topology graph for an in-context visualization and analysis and displaying the information in context is more suited for the interpretation of discovered patterns,
e.g., it allows us to explain association rule patterns in a more structural way as discussed above. The few operations we suggest on the web graph with association rules are powerful and can be extended for further web mining and pattern analysis.
6.
RELATED WORK
Visualizations of association rules can be viewed as depictions of many (one) to many (one) relationship of various information items. For years researchers have developed tools to help people analyze and understand association rules. Some of these tools have been already built in commercial data mining packages such as SGI MineSet [18], DBMiner [8], and DB2 Intelligent Miner [13]. Currently, there are two prevailing approaches to visualizing association rules. The first approach, two-dimensional(2D) matrix, is quite straight forward. The basic design is to position the antecedent and consequent items on different axes of a two-dimensional matrix. Then a rule can be shown in the corresponding cell by an icon with different visual cues representing data such as the support and confidence values. Not only is 2D matrix representation effective for showing one-to-one relationships, but it also works for many-to-many rules, by grouping all antecedent(consequent) items of an rule together as one unit along the axes [18]. However, when the antecedent (consequent) sets become large, the combination of items creates a much larger matrix, which is no longer practical for visualization. Wong et al. [22] propose an alternative technique to depict the rule-to-item relationship instead of item-to-item. They use columns for rules and rows for items. Each cell represents a rule-to-item relationship, which is antecedent, consequent, or no relation. The confidence and support values are visualized in threedimensional space by a bar chart, placed in the last row. This approach handles large itemsets well, but is restricted for a large set of rules. Directed Graph is another popular technique to visualize item associations. The nodes of a directed graph represent the items, and the edges represent associations. The technique works well for a small set of items and associations. The graph can quickly turn into a mess with as few as a dozen rules, especiallly when the context structure is not well defined. Hetzler et al.[11] solve this problem by animating the edges to show the associations of certain items with 3D arcs. Although visualization tools [12, 22] allow us to obtain an overview of all discovered rules, they are still inadequate, since tools encounter problem displaying a huge number of rules, most of which are not interesting at all. Meanwhile, pruning methods [16] help remove redundant rules but still require graphical interpretation. Liu et al. [15, 17] propose a way to prune and visualize association rules by allowing a user to specify his/her existing knowledge, which is used to analyze the discovered rules based on various “interestingness” criteria and through such analysis to select those potentially interesting rules. A recent study by Yang [23] proposed an approach of using parallel coordinates to visualize frequent itemsets and many-to-many association rules. None of the above mentioned approaches deal with visualization of association rules in web structure context and the possibility to apply operators on the visualization itself. Mining frequent patterns with constraints has been studied in [14] where the concept of monotone and anti-monotone and succinct were introduced to prune the search space. Jian
Pei et al. [19, 20] have also generalized these two classes of constraints and introduced a new convertible constraint class. In their work they proposed a new algorithm called F IC M which is an FP-Growth based algorithm [10]. This algorithm generates most frequent patterns before pruning them. Its main contribution is that it checks for monotone constraints early and once a frequent itemset is found to satisfy the monotone constraint, all subsequent itemsets having this item as a prefix are sure to satisfy the constraint and consequently there is no need to apply further checks. Dualminer [4] is the first algorithm to mine both types of constraints at the same time. Nonetheless, it suffers from many practical limitations and performance issues. First, it is built on the top of the MAFIA [5] algorithm which produces the set of maximal patterns without considering retaining sufficient information about item supports, and consequently all frequent patterns generated using this model do not have their support attached. Second, it assumes that the whole dataset can fit in main memory which is not always the case. Third, their top-down computation exploiting the monotone constraint often performs many useless tests for relatively large datasets; this raises doubts about the performance gained by pushing constraints in the Dualminer algorithm. In a recent study of parallelizing Dualminer [21], the authors showed that by mining relatively small sparse datasets consisting of 10K transactions and 100K items, the sequential version of Dualminer took an excessive amount of time. Unfortunately, the original authors of Dualminer did not show any single experiment to depict the execution time of their algorithm but only the reduction in predicate executions [4]. A recent strategy dealing with monotone and anti-monotone constraints suggests reducing the transactional database input via preprocessing by successively eliminating transactions that violate the constraints and then applying any frequent itemset mining algorithm on the reduced transaction set [1, 3]. The main drawback of this approach is that it is highly I/O bound due to the iterative process needed in re-writing the reduced dataset to disk. This algorithm is also sensitive to the results of the initial monotone constraint checking which is applied to full transactions. In other words, if a whole transaction satisfies the monotone constraint, then no pruning is applied and consequently no gains are achieved even if parts of this transaction do not satisfy the same monotone constraint. To overcome some of the issues in [1], the same approach has been tested against the FP-Growth approach in [2] with new effective pruning heuristics.
7. SUMMARY AND CONCLUSIONS Discovering trends and patterns in web navigation is undeniably advantageous to web designers and web-based application architects. Associations between web pages, their visits and connectivity articulated in terms of association rules are such useful web patterns. Visualizing the rules that express web page associations and their visits in the context of the web site structure is of major importance as it puts web page requests and their connectivity in perspective. Alas, the massive number of association rules typically discovered prevents any practical visualization, and thus interpretation of discovered patterns, with or without the web structure context. What is needed is a means to effectively and efficiently focus on relevant association rules to visualize pertinent relations between pages visited.
We proposed a framework for mining web data to understand web-page access behaviour vis-` a-vis a given connectivity. We essentially propose a pattern filtering at three levels to reach a practical and reasonably sized set of interpretable association rules. The first level consists of concentrating on only the relevant rules by means of constraint-based association rule mining. It is based on the expression of specific constraints on web pages and visits to limit the interestingness of the required patterns. For this most selective level we devised an efficient approach that considers constraints during the mining process whether monotone or anti-monotone. The second level of filtering is interactive and consists of putting restrictions on either the antecedent or consequent of a rule to focus on rules that start at or lead to a given web page. The third level of filtering is based on our web graph algebra. Operating on web graphs representing different time periods, our algebra allows the convergence on evolving association rules to hint on dynamics in web navigational behaviour. The evaluation of the visualization is premature. However, the visualization of the association rules together with, or more exactly overlapping, the web structure is convincingly advantageous in assisting in the interpretation of web navigational behaviour as well as the assessment the web structure effectiveness.
8.
ACKNOWLEDGMENTS
Our work is supported by the Canadian Natural Sciences and Engineering Research Council (NSERC), by the Alberta Ingenuity Centre for Machine Learning (AICML), and by the Alberta Informatics Circle of Research Excellence (iCORE).
9.
REFERENCES
[1] F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. Examiner: Optimized level-wise frequent pattern mining with monotone constraints. In IEEE ICDM, Melbourne, Florida, November 2004. [2] F. Bonchi and B. Goethals. Fp-bonsai: the art of growing and pruning small fp-trees. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 115–160, 2004. [3] F. Bonchi and C. Lucchese. On closed constrained frequent pattern mining. In IEEE ICDM, Melbourne, Florida, November 2004. [4] C. Bucila, J. Gehrke, D. Kifer, and W. White. Dualminer: A dual-pruning algorithm for itemsets with constraints. In ACM SIGKDD Conference, pages 42–51, Edmonton, Alberta, August 2002. [5] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In IEEE ICDE, pages 443–452, 2001. [6] J. Chen, L. Sun, O. R. Za¨ıane, and R. Goebel. Visualizing and discovering web navigational patterns. In 7th ACM SIGMOD International Workshop on the Web and Databases (WebDB), pages 13–18, June 2004. [7] E. H. Chi, J. Pitkow, J. Mackinlay, P. Pirolli, R. Gossweiler, and S. K. Card. Visualizing the evolution of web ecologies. In Proceedings of the Conference on Human Factors in Computing Systems CHI’98, 1998.
[8] Dbminer: http://www.dbminer.com/. [9] M. El-Hajj, O. R. Za¨ıane, and P. Nalos. Bifold constraint-based mining by simultaneous monotone and anti-monotone checking. In IEEE ICDM, Houston, TX, November 2005. [10] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD, 2000. [11] B. Hetzler, W. M. Harris, S. Havre, and P. Whitney. Visualizing the full spectrum of document relationships. In Proceedings of the Fifth International Society for Knowledge Organization (ISKO) Conference, pages 168–175, 1998. [12] H. Hofmann, A. P. J. M. Siebes, and A. F. X. Wilhelm. Visualizing association rules with interactive mosaic plots. In ACM SIGKDD Conference, pages 227 – 235, 2000. [13] Inteligent miner: http://www-306.ibm.com/software/data/iminer/. [14] L. Lakshmanan, R. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In ACM SIGMOD Conference on Management of Data, pages 157–168, 1999. [15] B. Liu, W. Hsu, S. Chen, and Y. Ma. Analyzing the subjective interestingness of association rules. IEEE Intelligent Systems, 15(5):47–55, 2000. [16] B. Liu, W. Hsu, and Y. Ma. Pruning and summarizing the discovered associations. In Knowledge Discovery and Data Mining, pages 125–134, 1999. [17] B. Liu, W. Hsu, K. Wang, and S. Chen. Visual aided exploration of interesting association rules. In Pacific-Asia Conference Knowledge Discovery and Data Mining (PAKDD), pages 380–389, 1999. [18] Mineset: http://www.purpleinsight.com/products/index.shtml. [19] J. Pie and J. Han. Can we push more constraints into frequent pattern mining? In ACM SIGKDD Conference, pages 350–354, 2000. [20] J. Pie, J. Han, and L. Lakshmanan. Mining frequent itemsets with convertible constraints. In IEEE ICDE Conference, pages 433–442, 2001. [21] R. M. Ting, J. Bailey, and K. Ramamohanarao. Paradualminer: An efficient parallel implementation of the dualminer algorithm. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD), pages 96–105, Sydney, Australia, May 2004. [22] P. C. Wong, P. Whitney, and J. Thomas. Visualizing association rules for text mining. In INFOVIS, pages 120–123, 1999. [23] L. Yang. Pruning and visualizing generalized association rules in parallel coordinates. IEEE Transactions on Knowledge and Data Engineering, 17(1):60–70, 2005. [24] O. R. Za¨ıane and M. El-Hajj. Pattern lattice traversal by selective jumps. In ACM SIGKDD Conference, pages 729–735, August 2005. [25] T. Zheng, Y. Niu, and R. Goebel. Webframe: In pursuit of computationally and cognitively efficient web mining. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 264–275, 2002.