Jul 17, 2002 - Torben Bach Pedersen. Technical ...... the request back to the web server. ... Quarter. Year. Day_of_week. Day_of_year. Workday. Holiday.
A Hybrid Approach To Web Usage Mining
Authors: Søren E. Jespersen Jesper Thorhauge Torben Bach Pedersen
Technical Report 02-5002 Department of Computer Science Aalborg University Created on July 17, 2002
A Hybrid Approach To Web Usage Mining Søren E. Jespersen Jesper Thorhauge Torben Bach Pedersen Department of Computer Science, Aalborg University {sej,jespert,tbp}@cs.auc.dk
Abstract With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web, or web mining, has become an important research area. Web usage mining, which is the main topic of this paper, focuses on knowledge discovery from the clicks in the web log for a given site (the so-called click-stream), especially on analysis of sequences of clicks. Existing techniques for analyzing click sequences have different drawbacks, i.e., either huge storage requirements, excessive I/O cost, or scalability problems when additional information is introduced into the analysis. In this paper we present a new hybrid approach for analyzing click sequences that aims to overcome these drawbacks. The approach is based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) and Click Fact Table approaches. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. The development is driven by experiences gained from industry collaboration. A prototype has been implemented and experiments are presented that show that the hybrid approach performs well compared to the existing approaches. This is especially true when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. The approach is not limited to web log analysis, but can also be used for general sequence mining tasks.
1 Introduction With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web, or web mining has become an important research area. Web mining can be divided into three areas, namely web content mining, web structure mining and web usage mining (also called web log mining) [11]. Web content mining focuses on discovery of information stored on the Internet, i.e., the various search engines. Web structure mining can be used when improving the structural design of a website. Web usage mining, the main topic of this paper, focuses on knowledge discovery from the usage of individual web sites. Web usage mining is mainly based on the activities recorded in the web log, the log file written by the web server recording individual requests made to the server. An important notion in a web log is the existence of user sessions. A user session is a sequence of requests from a single user within a certain time window. Of particular interest is the discovery of frequently performed sequences of actions by web user, i.e., frequent sequences of visited web pages. The work presented in this paper has been motivated by collaboration with the Zenaria company [25]. Zenaria is in the business of creating interactive stories, told through a series of video-sequences. The story is formed by a user first viewing a video-sequence and then choosing between some predefined options, based on the video-sequence. Depending on the user’s choice, a new video-sequence is shown and new options are presented. The choices of a user will form a complete story - a walkthrough - reflecting the choices made by the individual user. Traditionally, stories have been distributed on CD-ROM and a consultant has evaluated the walkthroughs by observing the users. However, Zenaria now wants to distribute their stories using the Internet, i.e., a walkthrough will correspond to a web session. Thus, the consultants will now have to use web usage mining technology to evaluate the walkthroughs. Much work has been performed on extracting various pattern information from web logs and the application of the discovered knowledge range from improving the design and structure of a web site to enabling companies to provide more targeted marketing. One line of work features techniques for working directly on the log file [11, 12]. Another line of work concentrates on creating aggregated structures of the information in the web log [21, 17]. The 1
Hypertext Probabilistic Grammar (HPG) framework [5], utilizing the theory of grammars, is such an aggregated structure. Yet another line of work focuses on using database technology in the clickstream analysis [1, 8], building so-called “data webhouses” [14]. Several database schemas have been suggested, e.g. the click fact star schema where the individual click is the primary fact[14]. Several commercial tools for analyzing web logs exist [24, 20, 22], but their focus is mostly on statistical measures, e.g., most frequently visited pages and they provide only limited facilities for clickstream analysis. Finally, a prominent line of work focuses on mining sequential patterns in general sequence databases [2, 3, 17, 18]. However, all the mentioned approaches have inherent weaknesses in that they either have huge storage requirements, slow performance due to many scans over the data, or problems when additional information, e.g., user demographics, are introduced into the analysis. In this paper we present a new hybrid approach for analyzing click sequences that aims to overcome these drawbacks. The approach is based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) [4] and Click Fact Table [14] approaches. The new approach attempts to utilize the quick access and the flexibility with respect to additional information of the click fact table, and the capability of the HPG framework to quickly mine rules from large amounts of data. Specialized information is extracted from the click fact schema and presented using the HPG framework. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. A prototype has been implemented and experiments are presented that show that the hybrid approach performs very well compared to the existing approaches. This is especially true when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. The approach is not limited to web log analysis, but can also be used for general sequence mining tasks. We believe this paper to be the first to present an approach for mining frequent sequences in web logs that at the same time provides small storage requirements, very fast rule mining performance, and the ability to introduce additional information into the analysis with only a small performance penalty. This is done by exploiting existing data warehouse technology. The paper is organized as follows. Section 2 describes the techniques on which we base our new hybrid approach. Section 3 describes the hybrid approach in details. Section 4 describes the prototype implementation. Section 5 examines the performance of the hybrid approach. Section 6 concludes and points to future work. Appendix A presents the case in question, describing the problem domain. Appendix B provides further details on the performance experiments. Appendix C presents the details of the database schemas. The paper can be read and understood without reading the appendices.
2 Background This section describes the approaches underlying our hybrid approach, namely the Click fact, Session fact, and Subsession fact schemas, and the Hypertext Probabilistic Grammar (HPG) approach. Each approach has both strong and weak points which are briefly described.
2.1 Database-Based Approaches The click fact schema uses the individual click on the web site as the essential fact in the data warehouse[14]. This will preserve most of the information found in a web log in the data warehouse. In our case, the click fact schema contains the following dimensions: URL dimension (the web pages) (note that the fact table has references to both the requested page and the referring page, also known as the “referrer.”), Date dimension, TimeOfDay dimension, Session dimension, and Timespan dimension. We could easily add any desired additional information as extra dimensions, e.g., a User dimension capturing user demographics. An example Click Fact schema can be found in Appendix C. Pros: A strong point of this approach is that almost all information from the web log is retained within the data warehouse. Individual clicks within the individual sessions can be tracked and very detailed click information is available. This approach can utilize existing OLAP techniques such as pre-aggregation to efficiently extract knowledge on individual pages. 2
Cons: When querying for sequences of clicks, using the click fact schema, several join and self-join operations are needed to extract even relative short sequences [1], severely degrading performance for large fact tables. The Session Fact schema uses the entire session as the primary fact in the data warehouse[14], thereby ignoring the individual clicks within a session. Only the requests for the start and end page are stored as a field value on the individual session fact entry. The approach is therefore not suitable for querying for detailed information about sequences of clicks or even individual clicks in a session. Queries on entire sessions are however quite efficient [1]. An example Session Fact schema can be found in Appendix C. We will not present strong and weak points for this approach, since it is unable to perform analysis of sequences of clicks which is a key requirement for us. The Subsession Fact schema was first proposed by Andersen et. al [1] and is aimed specifically at clickstream analysis on sequences of clicks. The approach introduces the concept of using subsessions as the fact, that is, explicitly storing all possible subsequences of clicks from a session in the data warehouse. This means that for all sessions, subsessions of all lengths are generated and stored explicitly in the data warehouse. In our case, the subsession fact schema has the dimensions Date, Session, TimeOfDay, Timespan, as well as the URLSequence dimension capturing the corresponding sequence of web page requests. An example Subsession Fact schema can be found in Appendix C. Pros: Storing subsessions explicitly in the database instead of implicitly (through join-operations) will allow for better performance on queries concerning sequences of clicks. There is a tradeoff between the amount of storage used and query performance, but if a web site has relatively short sessions , the storage overhead is manageable [1]. Thi technique can answer sequence-related queries much faster than, e.g., the click fact schema [1]. Cons: As hinted above, the longer the sessions the more storage is needed for storing the information for all subsessions. The number of subsessions generated from a session of length can be estimated using the formula
[1]. A study shows that the storage overhead can vary between a factor of 4 and a factor above 20, depending on the characteristics of the sessions on a web site [1]. Some methods for reducing the number of subsessions that are to be stored have been suggested, including ignoring either very short or very long subsessions or ignoring certain subsessions.
2.2 Hypertext Probabilistic Grammars The nature of web sites, web pages and link navigation has a nice parallel in formal language theory which proves rather intuitive and presents a model for extracting information about user sessions. The model uses a Hypertext Probabilistic Grammar(HPG)[15] that rests upon the well established theoretical area of languages and grammars. We will present this parallel using the example in Figure 2.1. A3 A1 A6 A5
A2 A4
Figure 2.1: Example of a Web site structure. The figure show a number of web pages and the links that connect them. As can be seen, the structure is very similar to a grammar with a number of states and a number of productions leading from one state to another. It is this parallel that the model explores. The model uses all web pages as states 1 and adds two additional artificial states, the 1
This is only true if the HPG is created with a so-called “history depth” of 1, see later.
3
start state S and the end state F, to form all states of the grammar. We will throughout the paper use the terms state and page interchangeably. From the processing of the sessions in the web log each state will be marked with the number of times it has been requested. The probability of a production is assigned based on the information in the web log so that the probability of a production is proportional to the number of times the given link was traversed relative to the number of times the state on the left side of the production was visited. Note that not all links within a web site may have been traversed so some of the links might not be represented in an HPG. The probability of a string in the language of the HPG can be found by multiplying the probabilities of the productions needed to generate the string. Note that web pages might be linked in a circular fashion and therefore the language of the HPG could be infinite. A HPG therefore specifies a threshold against which all strings are evaluated. If the probability of the string is below the threshold, the string will not be included in the language of the HPG (with the assigned threshold). This will generate a complete language for a given HPG with a given threshold, . Mining an HPG is essentially the process of extracting high-probability strings from the grammar. These strings are called rules. 2 These rules will describe the most preferred trails on the web site since they are traversed with a high probability. Mining can be done using both a breath-first and a depth-first search algorithm[4]. A parameter is used when mining an HPG to allow for mining of rules “inside” the grammar, that is rules with a leftmost state that has not necessarily been the first request in any session. This is done by assigning probability to productions from the start state to all other states depended on and whether or not the state are first in a session. An example of an HPG is shown in Figure 2.2. 0,25
S
A3
0,50 0,25
A1
0,75 0,25
1,00
1,00
0,66
A6
0,50
A5 0,34
A2
F
0,33 0,50
0,67
A4
Figure 2.2: Example of a Hypertext Probabilistic Grammar. Note that, as mentioned above, not all links on the web pages are necessarily represented in the grammar. The link from Figure 2.1 has not been traversed and is therefore not represented in the grammar. In Figure 2.2 a web page maps to a state in the grammar. The HPG can also be generated with a history-depth above . With a history-depth of e.g. each state represents two web pages requested in sequence. The structure of the HPG remains the same but each state now has a “memory” of the last states traversed. With , a state might then be named ! , for the traversal " . The mining of rules on the HPG using a simple breath-first search algorithm has been shown to be too un-precise for extracting a manageable number of rules. Heuristics have been proposed to allow for a more fine-grained mining of rules from an HPG[7][6]. The heuristics are aimed at specifying controls that more accurately and intuitively presents relevant rules mined from an HPG and allow for e.g. generation of longer rules and for only returning a subset of the complete ruleset. Representing Additional Information An HPG has no memory of detailed click information in the states, so if rules using additional information, e.g. sessions for users with specific demographic parameters, were to be mined, each production should be split into a number of middle-states, where each middle-state would represent some specific combination of the parameters. Each middle-state should then be labeled with the respective combinations of information. This is illustrated in Figure 2.3 were and # are original states and to $ represent new middlestates. Note that the probability of going from a middle-state to # is for each middle-state. 2
The notion of a rule and a string will be used interchangeably
4
1 0,02 0,10 0,13 A
0.89
0,32 B
A
2 3 4
0,08 0,07
5
0,17
6
B
7
Figure 2.3: Including demographic information. Note also that Figure 2.3 only illustrates additional information grouped into seven different categories, which e.g. could be some demographic information about the user. If several different kinds of information were to be represented, the number of middle-states would increase exponentially since all combinations of the different kinds of parameters should potentially be represented for all states in the HPG, thus creating a problem with stateexplosion. For instance, if the HPG should represent the gender and maritial status of users, each production could potentially be split into four middle-states. However, if the HPG should also represent whether a user had children or not, each production needs to be split into eight middle-states. This factor increases with the cardinality of each additional parameter. This solution scales very poorly. For instance, representing gender, age in years, salary (grouped into ten categories), number of children (five categories), job status (ten categories) and years of working experience could easily require each production to be split into over 4 million middle-states. This can be seen be % & &' (*)+ . Doing this for an HPG that includes only multiplying the cardinality of all parameters, ten interconnected states would require over 400 million states (including middle-states) in the full HPG. This is clearly not a scalable solution, since the number of clicks represented might not even be 400 million. Furthermore, the existing algorithms[4] should be expanded to be able to work on this new type of states in the HPG. Alternatively a single middle-state could be inserted in each production, containing a structure indicating the distribution of clicks over the parameters. This would effectively reduce the state-explosion but require significant changes to the existing mining algorithms to include parameters in the mining process and to support e.g. mining of rules for a range of parameter-values. Pros: The size of an HPG is not proportional to the number of sessions it represents but to the number of states and productions in the grammar, which makes the model very compact, scalable and self-contained. Additional sessions can easily be added to an existing HPG and thereby allow for an HPG to grow in the number of sessions represented without growing in size. Performing data mining on an HPG outputs a number of rules describing the most preferred trails on the web site using probability which is a relatively intuitive measure of the usage of a web site. Mining the HPG for a small, high probability set of rules requires the usage of the described heuristics which provides the user of the HPG with several parameters in order to tune the ruleset to his or her liking and extract rules for a very specific area. Cons: The HPG does not preserve the ordering of clicks which is found in the sessions added 3 . Therefore the rules mined could potentially be false trails in the sense that none or few sessions include the trail, but a large number of sessions include the different parts of the trail. In order to extract rules for a specific subset of all sessions in the HPG, a specialized HPG for exactly these sessions needs to be created. This is because a “complete” HPG does not have a memory of individual clicks and their session, as described above. A collection of sessions can potentially contain a very large number of different subsets so building and storing specialized HPGs for every subset is not a scalable option, as mentioned above. Therefore, if more specialized HPGs are needed there would be a need for storing the web log information so that the desired session could be extracted later on and a specialized HPG could be built and mined. 3
Using a history depth above 1 will preserve some ordering
5
2.3 Summary As mentioned in the preceding sections, each approach has some inherent strengths and weaknesses. The click and subsession fact approaches handle additional information easily, but result either in huge I/O or huge storage requirements. On the other hand, an HPG can efficiently mine long rules for a large collection of sessions, but is not able to represent additional information in a scalable way. To remedy this situation, we now present a new approach for extracting information about the use of a web site, utilizing the potential for mining very specific rules present in the HPG approach while still allowing for the representation of additional information using a click fact schema in a data warehouse, utilizing existing DW technology. The main idea is to create HPG’s on demand, where each dynamically created HPG represents a specialized part of the information in the web log.
3 The Hybrid Approach 3.1 Overview Our proposal for a hybrid approach combines the click fact schema with the HPG model, creating a flexible technique able to answer both general and detailed queries regarding web usage. Using the detailed information from the click fact schema results in almost no data being lost in the conversion from web log to database. However, we also need some kind of abstraction or simplification applied to the query results. In short, we want to be able to specify exactly what subset of the information we want to discover knowledge for. An HPG provides us with a simple technique to create an overview from detailed information. The scalability issues and somewhat lack of flexibility in the HPG model (see Section 2.2) must though also be kept in mind as we want to be flexible with regards to querying possibilities. The concept of the hybrid approach is shown in Figure 3.1. Clickfact Hypertext Probabilistic Grammar
Schema
Extraction
Database
Constraints
Construction
Mined Rules
Mining
2
323
/./ --,, .
101 0
Figure 3.1: Overview Of The Hybrid Approach. As mentioned in Section 2.2, creating specialized HPGs would require storing the original web log information and creating each specialized HPG when required. Storing the original web log file is not very efficiently since e.g. a lot of non-optimized string-processing would be required, so some other format needs to be devised. The click fact schema described above provides this detailed level, since it preserves the ordering of the web log and furthermore offers database functionality such as backup, optimized querying and the possibility of OLAP techniques such as pre-aggregation being applied. However, a more subtle feature of the click fact schema actually offers itself directly to the HPG model and proves useful in our solution. The database schema for the click fact table includes unique keys for both the referrer and the destination for each click. These two keys uniquely identify a specific production within a grammar since each key is a reference to a page and thus a state. Thereby we are able to extract all productions from the click fact table simply by returning all combinations of url_key and referer_key. Each occurrence of a specific combination of keys will represent a single traversal of the corresponding link on the website. Retrieving all states from an HPG is immediately possible from the click fact schema. The url_dimension table holds information about each individual page on the web site, therefore a single query could easily retrieve all states in the grammar and a count of how many times the state was visited, both in total and as first or last in a session. The queries can be used to initialize an HPG. This would normally be done using an algorithm iterating over all states in the sessions[4] but using the database representation, the required information can retrieved in a few simple database queries. Note that some post-processing of the query results are necessary for a nice in-memory representation.
6
3.2 Constructing Specialized HPGs Creating specialized HPG’s is indeed possible with this approach. Inserting a constraint layer between the database software and the creation process for an HPG will allow for restrictions on the information represented in the HPG. Extending the queries described above to only extract information for clicks with certain characteristics will allow for creation of an HPG only representing this information. Thereby rules mined on this HPG will be solely for clicks with the specific characteristics. Using this concept, specialized HPGs can be created on-demand from the database. For instance, the consultant might be interested in learning about the characteristics of walkthroughs for male, senior salesmen in a company. The consultant will specify this constraint and the queries above are modified to only extract the productions and the states that apply to the constraint. The queries will therefore only extract the sessions generated by male, senior salesmen in the company and the HPG built from these sessions will produce rules telling about the characteristic behavior of male senior salesmen. This approach utilizes some of the techniques earlier described but combines them to utilize the strong sides and avoid some of the weak sides. Pros: The limitations of the “simple” HPG framework (see Section 2.2) of not being able to efficiently represent additional information are avoided with this hybrid approach. The ability to easily generate a specialized HPG overcome the shortcomings of not being able to store all possible specialized HPGs. Saving the web log information in the click fact table (and thus in the database) gives us a tool for storing information which arguably is preferred to storing the original log file. A DBMS has many techniques for restoring, querying and analyzing the information with considerable performance gains over processing on raw textual data such as a log file. Combining the click fact schema, which offers a detailed level of information, and the HPG framework, which offers a more generalized and compact view of the data will allow for different views on the same data within the same model, without storing information redundantly on non-volatile storage. Cons: As the hybrid approach mine results using the HPG, false trails might be presented to the user, which is a characteristic inherited from the general HPG approach. This is obviously a critical issue since this might lead to misinterpretations of the data. Using a history depth greater than 1 might reduce the number of false trails. Open Issues: The performance of generating specialized HPG’s using queries on top of a database is an open issue that will be explored next. The number of rules mined from an HPG should not be too large and the rules should not be too short, so some of the heuristics mentioned in Section 2.2 needs to implemented when mining the HPG to be able to present the information to the user in a manageable way.
4 Prototype description 4.1 System Overview The architecture of the hybrid approach prototype is seen in Figure 4.1. All modules within the system are implemented in the Java programming language. The prototype works as follows. First, the web log file from the web server is converted into an XML-based format. Then a Quilt query[9] is executed on the XML file, resulting a new XML file. An XML parser is then invoked which parses the XML into the click fact schema contained within the database. This part of the system is called the Data Warehouse Loader. We then use a simple Graphical User Interface (GUI) to control how the SQL query extracting data from the database should be constructed. The SQL generator then constructs four SQL queries which are used to query the database. We call this part the Constraint Layer. The result from these queries are used to construct an HPG structure held in main memory. A BFS mining algorithm then extract rules from the HPG. This final part is simply named HPG.
4.2 Loading the Data Warehouse The system is designing to read information from a web log. The typical format of a web log is the Common Log Format (CLF)[23] or the Extended Common Log Format (ECLF)[23]. Such web logs contain the URL requested, the IP address4 of the computer making the request and timestamps. 4
Or a resolved name, if possible.
7
Rules
HPG Implementation
BFS Mining
Web log
Control GUI
CLF to XML Data Warehouse
HPG
Constraint Layer
Quilt Query
SQL Generator
Loader
XML Parser
Database
Figure 4.1: Overview of the prototype. An entry in a CLF log file for a single request is seen below. ask.cs.auc.dk - - [31/Oct/2001:09:48:16 +0100] "GET /education/dat1inf1 HTTP/1.0" 301 314
The identification of users is non-trivial as mentioned in Section 2.2. The prototype does not implement any logic to handle to user identification but simply assumes that an IP maps directly to a user. Also, the prototype does not at present include any means to avoid proxy caching of data. Instead of writing a data warehouse loader reading the CLF format directly, we have chosen to convert it into a more high-level format, based on XML. Once in this format, we are able to import and process our log data in a waste variety of programs which provides flexibility. The first step in loading the data into the data warehouse is cleansing the log data. Instead of cleansing directly on the log file using a temporary table in the database, or using other proposed techniques[10], we use an implementation of the Quilt query language named Kweelt[19]. Quilt is the predecessor of the new XQuery language. proposed by the W3C and is an SQL-like language for querying an XML structure. We use two Quilt queries to produce an XML file cleansed for irrelevant information and grouped into sessions for each host. When the web log is transformed into XML, we are ready to load it into the data warehouse. Using a SAX5 parser, we have implemented separate transformers from XML to each of the data warehouse schemas described in Section 2. Note that the only modifications needed if we want to change the underlying DBMS is to ensure that the SQL types used by the transformer is supported on the chosen DBMS and that a driver for connecting the transformer with the DBMS exists. Provided with the XML file, the transformers parse our XML structure into the data warehouse schemas and the loading is completed. No scheme for handling the uncertainty associated with the decision-time is implemented in the prototype but it could readily be extended to assume that a predefined timespan was used handling e.g. streaming-time for each request. Based on previous work [10], we have chosen to spawn a new session if the dwell time in a session exceeded 25 minutes. The spawning of a new session is handled entirely by the log transformers in the prototype, and not by the Kweelt queries.
4.3 Constraint Layer Implementation The main idea of combining a data warehouse with the HPG technique is the ability to constrain the data on which the HPG is built. A detailed description of how this constraining is achieved is presented here. We need SQL queries to extract our constrained set of data and then pass it on to a mechanism initializing our HPG. However, the extracted 5
Simple API for XML.
8
data must first be divided into session specific and click specific information. This distinction is very important, as the constructed HPG could otherwise be incorrect. Session specific: Dimensions which are specific to an entire session will, when constrained on one or more of their attributes, always return entire sessions as the result. One such dimension is the session dimension. If the click fact schema is constrained to only return clicks referencing a subset of all sessions in the session dimension, it is assured that the clicks returned will form complete sessions. Also, if we assume that all sessions start and finish on the same date, the date dimension will provide the same property as the session dimension. Clicks constrained on a specific date will then always belong to a connected sequence of clicks. In an HPG context it means that the constructed HPG never has any disconnected states - states where no productions are going either to or from. Click specific: Dimensions containing information about a single click will, if the click fact table is constrained on a subset of these keys, produce a set of single clicks which are probably not forming complete sessions. The probability of this will increase as the cardinality of the attribute grows or the number of selected attributes shrinks. For instance, when constraining on 3 URL’s from a set of 1000 in the URL dimension, the clicks returned will properly not constitute complete sessions and the probability of false trails dramatically increase. Furthermore, the HPG produced will then consist of 3 states with potentially no productions between them and some productions leading to states not included in this HPG. These 3 states are disconnected states. To be able to derive any rules from an HPG we need to have states with productions connecting them. As we want the approach to provide the overview as well as the detail, these two types of dimensions must be dealt with before the HPG is built. The solution proposed here is to constrain the data in two stages. The two stages (see Figure 4.2) can briefly be described as follows. First, we retrieve a temporary result using dimensions which are thought to be click specific (1a). The temporary result comes from joining the click specific dimensions with the click fact schema on all constrained attributes. The distinct session keys from the temporary result (1b) can be used to get the subset of all clicks having these session keys, which is done in step 2. These distinct keys are session specific and will assure an interconnected HPG. Second, we constrain the result using dimensions which are thought to be session specific and the distinct session keys (2a). All constrains must be fulfilled on the dimension when joining with the click fact schema. The collection of clicks retrieve are interconnected (2b). Note that both steps in practice will be performed as one query. 1a. Indentify clicks
1b. Retreive session specific keys
products.html 2a. Indentify all clicks in sessions
2b. Use all clicks to build HPG
Figure 4.2: Extraction of data for the HPG in two stages. When executing queries on the database, java.util.ResultSet objects are returned. The HPG’s initializing method is then provided with the respective ResultSet objects and dealt with internally in each method. The method used to initialize the HPG state set will initially run through the ResultSet object provided and create a State object for each row in the ResultSet object. Transparency between the specific SQL query and the result used for HPG initialization is maintained as the ResultSet object is well-defined and not affected by a change in the SQL query. The HPG initialization method is at this stage provided with data, and as such, the extraction of data from the data warehouse is completed. 9
4.4 HPG Implementation The HPG implementation consists of two parts; The HPG building method and a Breadth First Search (BFS) mining algorithm. The BFS algorithm is chosen instead of the Depth First Search (DFS) algorithm as the latter has a higher memory consumption [4]. We have chosen to implement both parts using algorithms and specifications as presented by Borges [4]. As our work is focused on creating a hybrid between the standard click fact schema and the standard HPG model we are not interested in altering any of these techniques. The only difference between the prototype implementation and work by Borges[4] is, that rules and productions are stored in main memory instead of a database. The HPG implemented in the prototype holds two simple lists in main memory, one for states and one for productions. These lists are initialized from the results from the constraint layer and the mining process work entirely on these lists.
5 Experimental Evaluations In evaluating the hybrid approach we decided to evaluate the performance of the creation of HPGs on-the-fly by it self and against a straight-forward SQL-query on a subsession schema (see Section 2.1). Evaluating the hybrid model by it self will allow us to identify potential bottlenecks in the creation of HPGs on-the-fly and target any optimization to the areas where performance suffers. Evaluating against the subsession schema will allow us to more precisely foresee for what type of queries our approach could be a better choice than the subsession schema and how the arguably greater effort of building and mining an HPG performs against a straight-forward query on a database. We decided to evaluate against the subsession schema since Zenaria as described will mostly be interested in information about sequences of clicks. As walkthroughs are generally long sessions, the subsession schema will arguably perform better than the click fact schema since each additional click in a sequence returned from a click fact schema requires self-join operations on the click fact table query[1]. We decided not to evaluate against a complete HPG representing the specialized information since we argued, that this solution would scale very poorly and the mining algorithms would need some expansion (see Section 2.2). Also, techniques requiring a number of scans of the database for mining the patterns [2, 3, 17, 18] would perform too poorly to be used for on-line mining.
5.1 Experimental Settings We now describe the various elements used as settings for our experimental evaluation. The goals of these experiments is to retrieve information concerning the performance of creating HPGs on-the-fly on top of a database. The experiments will include both unconstrained mining and mining constrained on both click and session specific dimensions (see Section 4.3) and be performed on both a non-indexed and an indexed database. The conclusions of the experiments are summarized in Section 5.4. We have used the web log from the Computer Science Department at Aalborg University for October 2001. The web site is primarily an information site for students and staff but also contain personal sites with a wide range of content. The web log was 300 MB in ECLF, but we used only CLF-compliant entries, only accepted requests processed with code 200 (OK), and only included requests from outside users. In overall the web log then contained 33000 sessions for a total of 115000 valid clicks amongst 27400 unique pages. TM The DBMS used in the evaluation is a MySQL 4.0.1 running on an Intel 933 Mhz machine. The prototype TM 1800 Mhz machine through a JBoss Application server described in Section 4 is running on an AMD Athlon 2.4.4 and implemented using JSP-files running on Tomcat 4.0.1-beta. The tests were executed from a JSP file, since this was the entry point initially chosen to the prototype. The insertion and generation of subsessions can be quite resource-demanding [1] so in order to avoid filling up the database with very long subsessions (some of the session from the web log where as long as 1900 requests) we adopted the suggested convention[1] of limiting the subsession length to 10 clicks. This will avoid a massive blowup in the number of subsessions generated and ensure a constant number of extra subsessions being generated for each request longer than 10 clicks. Insertion of subsessions yielded a total of 620000 entries in the subsession fact table divided amongst 490000 different subsessions. We now turn to the actual experimental results.
10
5.2 Analysis of The Hybrid Approach The combination of using a click fact schema together with the HPG approach might pose problems or bottlenecks for the hybrid approach as a whole. To clarify if certain parts of the approach performs better than others, and thereby identifying possible areas in which optimization could have a significant effect, different experiments were performed. These experiments are not performed for comparison with other techniques, but merely as a detailed measurement of each part in the system. In the process of extracting rules from the data warehouse, three main tasks are performed. First, the database is queried for data used to initialize the HPG. Second, the HPG is built in main memory based on the extracted data and third, the BFS mining algorithm extracts rules from the HPG. To test how the approach performs as the number of states increase, a range query test was performed. All clicks belonging to sessions in which the amount of time being spent is below a certain threshold was extracted. Ranging from 0 seconds to 20000 seconds, all clicks belonging to sessions in which e.g. the total time spend is below 3000 seconds was extracted and so forth. Increasing the threshold in steps of 100 seconds up to 20000 seconds then increases the amount of states included in the HPG. The result of this experiment is shown to the left in Figure 5.1.
Figure 5.1: Total Processing Time and Database Query Time The figure shows that the by far most time consuming part in the hybrid approach is the database query time. In most cases, more than 90% of the total time is spent in this part. On the other hand, the time used for the BFS mining is so short that it can hardly be distinguished in the figure. The fact running time is the result of an optimization of the data structures used for representing the HPG. The time spent on initializing the HPG only adds little processing time to the overall processing time, usually 5–10%. A detailed version of the time used to initialize the HPG can be found in Appendix B. Thus, it seems that a possible tuning effort should focus on the database query part. The most promising way to optimize the query time would be to apply materialized views[16], which could precompute most of the aggregate results required to build the HPG, yielding a huge performance improvement and making the database query time almost independent of the number of clicks in the database However, the MySQL database used in the prototype does not implement this feature, so the experiment of using this technique is left for future work. The right side in Figure 5.1 which presents the database querying part of the total processing, shows that the query times are distributed with 60–70% for the StateSet extraction query, 20-30% for the ProductionSet extraction query and 5–10% for the Session extraction query. Here, we have made the optimization of reusing the ProductionSet result to compute the total number of requests for each state. However, optimizing the query part should not be targeted at a specific database query, but at the overall database performance. One of the most obvious ways to optimize the database, is to create indexes on the key attributes in the click fact schema. Our experiments (presented in detail in Appendix B) shows a performance gain of around 25% from doing this. As mentioned above, materialized views is another promising opportunity for tuning.
11
5.3 Comparative Analysis We now compare the hybrid approach to the subsession schema. We compare the two approaches on extraction of information for all sessions, session specific constraints and click specific constraints (see Section 4.3 for a explanation). All sessions: This test compares the hybrid approach, extracting all clicks to the HPG and mining, against the subsession schema, which must query the database for the most frequent and longest subsessions. The results for indexed and non-indexed databases are shown to the left in Figure 5.2. It is seen that the non-optimized hybrid approach performs poorer than the subsession schema by a factor of approximately 25. However, the hybrid approach where both the database querying and the BFS mining is optimized, takes only 30% longer than the subsession approach. This is rather surprising, since the subsession approach only needs to query on one table in the database, whereas the hybrid approach needs to extract information to memory and perform the mining process on the built HPG. The small performance advantage of the subsession approach comes at a large price: the storage requirements are more than 5 times higher than for the hybrid approach, due to the need to store subsession of all lengths.
Figure 5.2: Hybrid Versus Subsession Approach For All Rules and Click Constraints Click specific constraints: This test is evaluated by extracting information for all sessions containing requests including the string tbp, which is a specific personal homepage on the web site. The results using both approaches with this constraint (again, both non-indexed and indexed) is shown to the right in Figure 5.2. The hybrid approach proves to be a factor of 10 faster than the subsession schema approach, even in the non-optimized case, the optimized version is 20 times faster. The obvious reason for the relative slow database query in the subsession schema approach is the need to compare each entry in the url_sequence_dimension with tbp. This is necessary since the sequence of URL-requests are stored as a single string, so all entries must be compared.
Figure 5.3: Hybrid Versus Subsession Approach For Session Specific Constraints 12
Session specific constraints: This test resembles the range query test performed in Section 5.2. The session specific constraint is evaluated using a constraint on the total_session_seconds field from the session_dimension table. Figure 5.3 shows the results of evaluating the hybrid approach and the subsession schema on this session constraint, using both a non-indexed and indexed version of the schemas. The figure shows that the optimized hybrid approach outperforms all the other approaches by an average factor of 3–4. All the other approaches are rather similar in performance. Again, the superior performance of the optimized hybrid approach is accompanied by a storage requirement that is more than 5 times smaller than the subsession approach. Note furthermore that the hybrid approach provides some functionality which cannot be achieved with the subsession schema. Since the click fact schema stores each click, click specific constraints can be put on the time spend on individual clicks. This is not possible with the subsession schema approach, since the spend on individual clicks is not stored in the schema.
5.4 Experiment Summary To conclude, our evaluation of the hybrid approach has shown that the optimized hybrid approach is very competitive when compared to the subsession fact approach. Even with a storage requirement that is more than 5 times smaller than for the subsession fact approach, the optimized hybrid approach performs 3–20 times faster. Only for mining all rules the performance is a little slower, making the hybrid approach the clear winner. As a consequence, the query response time is fast enough to allow on-line rule mining. Furthermore, the hybrid approach allows for click specific constraints on the time spent on individual pages, which cannot be handled by the subsession fact approach. Also, the subsession fact table approach will break down if the average sessions are very long, as in the Zenaria case.
6 Conclusion and Future Work Motivated by the need for an efficient way of mining sequence rules for large amounts of web log data, we presented a hybrid approach to web usage mining that aims to overcome some of the drawbacks of previous solutions. The approach was based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) [4] and Click Fact Table [14] approaches. The new approach attempted to utilize the quick access and the flexibility with respect to additional information of the click fact table, and the capability of the HPG framework to quickly mine rules from large amounts of data. Specialized information was extracted from the click fact schema and presented using the HPG framework. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. A prototype has been implemented and experiments are presented that show that the hybrid approach performs very well compared to existing approaches, especially when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. In future work, we will finish the development of a tool for Zenaria. First of all, a front end which can be used by the consultants needs to be developed, including methods for visualization of the extracted rules. The mining process should be modified to include the heuristics mentioned in Section 2.2, which will allow for a better control of rule extraction. Solutions to the open issues mentioned in Section A needs to be added to the Data Warehouse Loader module, including, e.g., user tracking using cookies. Optimization of the prototype, especially the database querying is also an area of future improvement. The hybrid approach inherits the risk of presenting false trails from the HPG as mentioned in Section 3. Developing a method to work around this risk, potentially utilizing history depth or some other technique is an interesting way to go. Finally, an expansion of the HPG mining process where individual pages could be assigned a measure of importance in the mining process is desirable. Such an expansion would improve the capability of the consultants to tune the mining process to specific pages in the interactive stories.
References [1] J.Andersen, A. Giversen, A. H. Jensen, R. S. Larsen, T. B. Pedersen, and J. Skyt. Analyzing clickstreams using subsessions. In Proceedings of the Second International Workshop on Data Warehousing and OLAP, 13
pp. 25–32, 2000. [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pp. 6–10, 1995. [3] R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the Fifth International Conference on Extending Database Technology, pp. 3–17, 1996. [4] J. Borges. A Data Mining Model to Capture User Web Navigation Patterns. PhD thesis, Department of Computer Science, University College London, 2000. [5] J. Borges and M. Levene. Data mining of user navigation patterns. In Proceedings of WEBKDD, pp. 92–111, 1999. [6] J. Borges and M. Levene. Heuristics for mining high quality user web navigation patterns. Research Note RN/99/68. Department of Computer Science, University College London, Gower Street, London, UK, 1999. [7] J. Borges and M. Levene. A fine grained heuristic to capture web navigation patterns. SIGKDD Explorations, 2(1):40–50, 2000. [8] A.G. Büchner, S.S. Anand, M.D. Mulvenna, and J.G. Hughes. Discovering internet marketing intelligence through web log mining. SIGMOD Record, 27(4):54–61, 1998. [9] D. D. Chamberlin, J. Robie, and D. Florescu. Quilt: An XML query language for heterogeneous data sources. In WebDB (Informal Proceedings), pp. 53–62, 2000. [10] R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1):5–32, 1999. [11] R. Cooley, J. Srivastava, and B. Mobasher. Web mining: Information and pattern discovery on the world wide web. In Proceedings of the Ninth IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97), 1997. [12] R. Cooley, P. Tan, and J. Srivastava. Websift: the web site information filter system. In Proceedings of the 1999 KDD Workshop on Web Mining, 1999. [13] Jiawai Han and Michelle Kamber. Data Mining - Concepts and Techniques. Morgan Kaufmann, 2000. [14] Ralph Kimball and Richard Merz. The Data Webhouse Toolkit. Wiley, 2000. [15] Mark Levene and George Loizou. A probabilistic approach to navigation in hypertext. Information Sciences, 114(1-4):165–186, 1999. [16] Oracle Corporation. Oracle 8.1.5i Tuning with materialized views. http://www.cs.auc.dk/ndb/oradoc/doc/ server.815/a67775/ch2.htm. [17] J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logs. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 396–407, 2000. [18] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth. In Proceedings of the Seventeenth International Conference on Data Engineering, pp. 215–224. [19] Arnaud Sahuguet. Kweelt – querying xml in the new millennium, 2000. http://db.cis.upenn.edu/Kweelt. [20] Sawmill, http://www.sawmill.net. [21] Myra Spiliopoulou and Lukas C. Faulstich. WUM: a Web Utilization Miner. In Proceedings of the Workshop on the Web and Data Bases, pp. 109–115, 1998. [22] WebTrends LogAnalyzer. http://www.webtrends.com/products/log/. [23] World Wide Web Consortium. fig/Logging.html.
W3c httpd common log format.
http://www.w3.org/Daemon/User/Con-
[24] K.-L. Wu, P. S. Yu, and A. Ballman. Speedtracer: A web usage mining and analysis tool. IBM System Journal, Internet Computing, Volume 37, 1998. [25] Zenaria A/S. http://www.zenaria.com. 14
A
The Zenaria Case
The paper is written in cooperation with the company Zenaria A/S[25]. This section will provide some background information about the company and describe the area in which they are in need of a new solution. It will also include some details about adopting the web mining concept to the Zenaria case.
A.1 Current Scenario Zenaria A/S is in the business of creating interactive stories mainly in the form of a story told through a series of video-sequences. The story is formed by a user viewing a video-sequence and choosing between some predefined options based on the current video-sequence. Depending on the actual choice of the user, a new video-sequence is shown and new options are presented. The choices of a user will form a complete story - a walkthrough - and the outcome of this walkthrough will reflect the choices made by the individual user. An example of the structure of an interactive story is illustrated in Figure A.1. Stories vary in length but a typical walkthrough of a story features around 15 video-sequences. A screen-shot from a video-sequence is shown in Figure A.2.
Figure A.1: Example of the structure of a story Stories are typically designed to educate and evaluate employees in a company and the evaluation is supported by a consultant supervising the walkthrough of each user. Several factors weigh into the evaluation of a single user, but the complete story as formed by the choices and the time spent from a video-sequence has been viewed to a choice is made, referred to as the decision-time 6 , are important parameters in evaluating a user. The stories are distributed on individual CD-ROMs and does not at present include any way of storing the information generated by walkthroughs in a central database. The consultant must therefore supervise the walkthrough of most users to be able to evaluate the employees of a company as a whole.
A.2 Future Scenario Zenaria A/S would like to change this way of distributing their interactive stories and use the Internet instead. The company do not wish to develop a specialized program or use any existing proprietary program to allow users to access the stories online. They have decided to allow users to access the stories using only a web browser. This will allow users to access the stories at all times and from any computer connected to the Internet. Distributing the stories this way does however make the task of the consultant rather impossible. He will no longer be able to supervise all users and will therefore not be able to evaluate the usage of the story as a whole, at least not without some extra information being presented to him. The main task is therefore to present as many parameters as possible 6
Not to confuse with the dwell-time, see later.
15
Figure A.2: Screenshot from a Zenaria story. to the consultant and thus enable the consultant to perform some of the same evaluation(s) as before. This means, amongst other things, that the walkthrough of users must be traceable and the decision-time for each user on each video-sequence should also be available.
A.3 Using Web Usage Mining Since access to the interactive stories is done entirely through the HTTP protocol, the web log will reflect all walkthroughs of a story as a number of requests forming user sessions. Presenting such information to the consultant and allowing for some kind of querying upon the information would allow the consultant to evaluate the usage of the story, without actually being present when the users walks through the story 7 . Designing a system capable of supporting the job of the consultant requires some knowledge of what types of questions the consultant could desire to have answered. Some of the important types are mentioned below.
4 Questions concerning common choices of users, e.g. finding a typical path through the story or parts of the story that have been used frequently by the users.
4 Questions concerning the decision time of the users, e.g. what constitutes a typical or a minimal walkthrough measured in decision-time.
4 Questions focusing on how specific groups of users have used the story e.g. all men or all team leaders within a company
4 Questions combining elements of the above would further improve the usefulness of a web mining tool for the consultant. Note specifically that the consultant is interested in how the stories are used and not as much in the usage of individual video-sequences. This means that there are a definite need for knowledge discovery on sequences of clicks in the web log. 7 Note that using the previous approach, the consultant could visually see the user. Using the Internet, this perspective is lost for the consultant.
16
The task is therefore to create an environment in which the consultant easily can pose these and other questions and have results returned and illustrated in a manner that supports the consultant in evaluating the use of a story. Answering questions concerning specific groups of users will require the users to specify some information, e.g., demographic data, which can be used by the consultant in a later evaluation. It is not a part of the software currently used by Zenaria so this must also be supplied with a new framework. Note also that the consultant cannot be presumed to be a “technical” person and therefore the tools must present themselves easily and understandably to the consultant.
A.4 Potential Problems Several specific problems arise when attempting to adopt web usage mining to the problem domain of Zenaria. Some of these will be mentioned briefly in the following. Note that some of the problems are inherent to web mining whereas others are specific to the Zenaria case. The solution chosen to these problems in our implemented prototype is described in Section 4. Estimating Decision-time: As mentioned in Section A.2, tracking the decision-time is important, but transferring this to an Internet setting is rather hard. In any case it will only be possible to create an estimate of this time, since the web log of the server will only reflect the time between two requests. In between two requests a lot of other things besides the decision-time will take place namely generating the response, transferring the content to the client (where the streamed video in Zenaria’s case is a potential bottleneck), viewing the video-sequence, and transferring the request back to the web server. Therefore some logic handling the estimation of the actual decision-time needs to be implemented, or a schema where this uncertainty is handled must be adopted. Tracking the Chosen Option: Several choices presented after viewing a video-sequence might lead to the exact same next video-sequence (and thus potentially the same web page) but have very different meanings, semantically, when evaluating the walkthrough. This presents a need for the consultant to know, not only what sequence of pages are viewed, but also what choices were taken leading from one web page to another. This calls for either sending the option chosen along with the request as a parameter (thus identifying the option from the request-string recorded in the web log) or solving it using some other kind of server side logic. User Identification: Operating only on the web log will make the process of uniquely identifying users impossible. The only information present is the IP address which only identifies a computer 8 and not a user. The problem multiplies when computers support several concurrent users making requests to the web server. A similar problem arises when a number of users are located behind a gateway or a firewall, since all request made behind the gateway will appear in the web log with the IP-address of the gateway. Several schemas for solving these problems have been suggested and includes using cookie-identification and web server logic to identify users[14]. There are also issues concerning privacy on the Internet that will further hinder identification of users, but that is beyond the scope of this paper. Tracking Requests: Some users access the Internet through a proxy-server that stores information locally, away from the web server. Depending on how Zenaria choose to implement the framework, the fact that requests will not be forwarded all the way to the web server will severely hinder the tracking of walkthroughs from the web log. Some solutions are suggested[14] including explicitly setting no-cache information on the response to the client and calculating unseen requests.
B Additional Experiments This section contains more details about the experimental evaluation of the hybrid approach. First, the running time of the BFS mine algorithm is presented to the left in Figure B.3. We see that the time used by BFS-Mine is neglible (less than 50ms) in almost all cases. Only in one case the time goes up to 350 ms, probably due to the garbage collection performed by the Java runtime system. As the other parts of the hybrid approach take several seconds, the time used for BFS-mine makes no real difference. 8
An IP-address can be re-assigned using e.g. DHCP, so in principle it does not even identify an individual computer.
17
Figure B.3: BFS Mining Performance and Non-indexed vs Indexed Click Fact Next, experiments with indexing the click fact schema are presented. As the keys from each dimension are widely used when joins with the fact table are performed, we have chosen to create indexes on the primary keys of the dimension tables. The performance gained when comparing a non-indexed version versus this specific type of indexed version of the click fact schema can be seen to the right in Figure B.3. It is seen that a performance gain of around 25% is obtained by creating these indexes. This kind of indexing strategy might not even be the most optimal. Creating an index on all referenced keys in the fact table could further decrease the query time.
Figure B.4: HPG Initialization Time A detailed version of the time used to initialize the HPG can be found in Figure B.4. The HPG initialization process is divided into three parts. First, the set of states for the HPG is initialized. Then, the set of productions are initialized and finally start production probabilities are calculated. Figure B.4 shows that this last step uses very little time. Otherwise the initialization time is evenly divided amongst stateset and productionset initialization. Again, we believe the spikes in the figure to be caused by the Java garbage collection. However, the spikes only make a difference of about 200 ms and are thus of no importance for the overall performance.
C
Data Warehouse Schemas
This section contains examples of DW schemas for the Click Fact, Session Fact, and Subsession Fact approaches. The Click Fact schema is shown in Figure C.5. The Session Fact schema is shown in Figure C.6. The Subsession Fact schema is shown in Figure C.7.
18
TimeOfDay Dimension URL Dimension Url_key Url_name Part_of_site
Click Fact
Date Dimension Date_key Day_of_month Month Quarter Year Day_of_week Day_of_year Workday Holiday
Url_key Referrer_key Date_key TimeOfDay_key Session_key Timespan_key Number_in_session Is_first Is_last Click_seconds
TimeOfDay_key Hour Minute Second Secs_since_midnight Time_span
Session Dimension Session_key Ip Login Start_page End_page Session_clicks Total_session_seconds
Timespan Dimension Timespan_key Seconds
Figure C.5: The Click Fact Schema
TimeOfDay Dimension
Date Dimension Date_key Day_of_month Month Quarter Year Day_of_week Day_of_year Workday Holiday
Session Fact Date_key TimeOfDay_key User_key Start_page End_page
TimeOfDay_key Hour Minute Second Secs_since_midnight Time_span
User Dimension User_key Ip Login
Figure C.6: The Session Fact Schema
Url_Sequence Dimension Url_sequence_key Url_sequence Is_last Is_first Length Number_of Last_url First_url
Session Dimension
Subsession Fact Url_sequence_key Session_key TimeOfDay_key Date_key Timespan_key
Session_key Ip Login Start_page End_page Session_clicks Total_session_seconds
TimeOfDay Dimension Date Dimension
TimeOfDay_key Hour Minute Second Secs_since_midnight Time_span
Date_key Day_of_month Month Quarter Year Day_of_week Day_of_year Workday Holiday
Timespan Dimension Timespan_key Seconds
Figure C.7: The Subsession Fact Schema
19