Evaluation of Web Usage Mining Approaches for ... - Semantic Scholar

Evaluation of Web Usage Mining Approaches for User’s Next Request Prediction Mathias Gery, ´ Hatem Haddad Information Technology Department VTT Technical Research Centre of Finland, Espoo, Finland

Mathias.Gery, Ext-Hatem.Haddad @vtt.fi ABSTRACT

analysis of useful information from the World Wide Web”. It is also defined as “the application of data mining techniques to large Web data repositories” [9]. By citing the definition that Cooley et al. gave in [10], Web usage mining is the “automatic discovery of user access patterns from Web servers”.

Analysis of Web server logs is one of the important challenge to provide Web intelligent services. In this paper, we describe a framework for a recommender system that predicts the user’s next requests based on their behaviour discovered from Web Logs data. We compare results from three usage mining approaches: association rules, sequential rules and generalised sequential rules. We use two selection rules criteria: highest confidence and lastsubsequence. Experiments are performed on three collections of real usage data: one from an Intranet Web site and two from an Internet Web site.

We define Web Usage Mining as the application of established data mining techniques to analyze Web site usage. With Web usage, we refer to the behaviour of one or more users on one or more Internet Web sites; the main character is the user, and in order to analyze and to study his behaviour, some data is needed. There are a lot of works done in the field of Web Usage Mining (WUM), trying to improve Web search by analysis of user actions. The idea is to investigate the action that the user makes on the search results and/or during the navigation that follows. If, given 10 best matches for a query, a user selects the item with rank 5, this tells lots of valuable information about that item, and also something about the 4 preceding potential answers. Also, if the user has visited a given set of pages after his query, this tells lots about the interest of these pages related to the user’s query. Without considering any query, it is also very interesting to investigate the user’s action while navigating a Web site.

Categories and Subject Descriptors H.2.8 [Database Applications]: Data mining

General Terms Algorithms, Experimentation

Keywords Web Usage Mining, Association Rules, Frequent Sequences, Frequent Generalised Sequences, Evaluation

1.

INTRODUCTION

In this paper we describe a framework for a recommender system that predicts the user’s next requests based on their behaviour discovered from Web Logs data. The paper is organized as follows: the section 2 introduces the problematic. The section 3 presents some basic considerations and definitions about the user and his navigation. Then, we describe the three experimented Web Usage Mining approaches (association rules, frequent sequence rules and frequent generalised sequence rules) in section 4 and the following prediction in section 5, while section 6 describes our results using three collections of real Web usage data.

In the WWW context, recommender systems are becoming widely used by users and information retrieval systems to perform results of both prefetching and recommendation [11]. In the literature, most researchers focus on Web usage mining that analyse Web logs with a process of discovering knowledge in databases. Indeed, Web sites are generating a big amount of Web logs data that contain useful information about the user behaviour. The term “Web Usage Mining” was introduced by Cooley et al. in 1997 [10] when a first attempt of taxonomy of Web Mining was done; in particular they define Web mining as the “discovery and

2. MOTIVATION AND RELATED WORK Today, millions of visitors interact daily with Web sites around the world and massive amounts of data about these interactions are generated. We believe that this information could be very precious in order to understand the user’s behaviour.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WIDM’03, November 7–8, 2003, New Orleans, Louisiana, USA. Copyright 2003 ACM 1-58113-725-7/03/0011 ...$5.00.

Web Usage Mining is achieved first by reporting visitors traffic information based on Web server log files. For example, if various users repeatedly access the same series of

74

on January, 26th at 3.05 pm. In fact, the “user” is a spider from Google (Googlebot).

pages, a corresponding series of log entries will appear in the Web log file, and this series can be considered as a Web access pattern.

We never know when a user leaves a site: users do not warn Web servers that their visit has ended. A visit is a collection of user clicks on a single Web server during a user session. The only criterion proposed to identify a server session is to detect when the client has not accessed the site for a reasonably long time interval (e.g. 25.5 minutes [6]) by examining the next entries in the access log.

In the recent years, there has been an increasing number of research works done in Web Usage Mining (cf. [7, 8, 18, 3, 4, 13] and their developments). The main motivation of these studies is to get a better understanding of the reactions and motivations of users navigation. Some studies also apply the mining results to improve the design of Web sites [16], analyse system performances and network communications or even build adaptive Web sites [20]. We can distinguish three Web Mining approaches that exploit Web logs: association rules (AR) [11] [5], frequent sequences [17] and frequent generalised sequences [15] [12]. Algorithms for the three approaches were developed but few experiments have been done with real Web log data. In this paper, we compare results provided from the three approaches using the same Web log data.

3.

Another problem concerns the user identification. It is probably more appropriate to refer to client instead of user. If the server do not use authentication techniques, it is impossible to know exactly who really requests for resource on the Web. The user identification is important because the user is the character which creates a transaction; therefore, in the pre-processing phase, the user identification is applied before the transaction one.

4. WEB USAGE MINING

CONSIDERATIONS AND DEFINITIONS

4.1 Association rules (AR)

With the aim to study Web usage mining, we present in this section our definition of user and navigation.

The first studied approach we use is based on association rules. The problem of association rules is well defined in the literature [1, 2].

3.1 User and navigation A user is identified as a real person or an automated software, such as a Web Crawler (i.e. a spider), accessing files from different servers over WWW. The simplest way to identify users is to consider that one IP address corresponds to one distinct user. A click-stream is defined as a time-ordered list of page views. User’s click-stream over the entire Web is called the user session. Whereas, the server session is defined to be the subset of clicks over a particular server by a user, which is also known as a visit. Catledge has studied user page view time over WWW and recommended 25.5 minutes for maximal session length [6]. An episode is defined as a set of sequentially or semantically related clicks. This relation depends on the goals of the study.

Association rules mining was first proposed to find all rules in a basket data (also called transaction data) to analyze how items purchased by customers in a shop are related (one data record per customer transaction). The association rule generation is achieved from a set F of frequent itemsets in an extraction context D, for the minimal support minsup. An association rule r is a relation between itemsets of the form r : X ⇒ (Y − X), in which X and Y are frequent itemsets, and X ⊂ Y . The itemsets X and (Y − X) are called, respectively, antecedent and consequence of the rule r. The valid association rules are those of which the measure of support and confidence, is greater than or equal to the minimal thresholds of support and confidence, called minsup and minconf. Support and confidence are calculated as follows:

3.2 Web log files The easiest way to find information about the users navigation is to explore the Web server logs. The server access log records all requests processed by the server. Server log L is a list of log entries each containing timestamp, host identifier, URL request (including URL stem and query), referer, agent, etc. Every log entry conforming to the Common Log Format (CLF) contains some of these fields: client IP address or hostname, access time, HTTP request method used, path of the accessed resource on the Web server (identifying the URL), protocol used (HTTP/1.0, HTTP/1.1), status code, number of bytes transmitted, referer, user-agent, etc. The referer field gives the URL from which the user has navigated to the requested page. The user agent is the software used to access pages. It can be a spider (ex.: GoogleBot, openbot, scooter, etc.) or a browser (Mozilla, Internet Explorer, Opera, etc.).

Support(X) =

|{t ∈ D | X ⊆ t}| |D|

(1)

support(Y − X) (2) support(X) The problem of finding Web pages visited together is similar to finding associations among itemsets in transaction databases. Once transactions have been identified, each of them could represent a basket, and each resource an item. Conf idence(r) =

We define a set of sessions S = {S1 , ..., Sn } where n is the number of sessions in S and a set of URLs R = {url1 , ..., urlu } where u is the number of URLs in S. Each session Si is defined by an IP and URLS: Si = (IPi , U RLsi )

Here is an example from the MRIM Web server log:

IP identifies the user of the session and URLs is the set of pages requested by the user. We define m as the number of pages requested by a user IPi in a session Si . It depends on the gap clicks ∆ used to define a session (i.e. ∆ is the maximum session length). In our experiments, we have used a value of ∆ = 25.5 minutes (as in [19] and recently in [11]).

64.68.82.52 - - [26/Jan/2003:15:05:37 +0100] "GET /membres/mathias.gery/ HTTP/1.0" 200 12845 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

In this example, somebody using the machine with the IP 64.68.82.52 has requested the page /membres/mathias.gery/

75

U RLsi = {urli,1 , ..., urli,k , ..., urli,m } where urli,k ∈ R

In order to extract frequent generalized subsequences, we have used the general algorithm proposed by Gaul. This is an Apriori algorithm adapted to the generalized sequences, based on the fact that the support of any subsequence of a closed generalized sequence is greater than or equal to the support of the sequence itself. Thus, we can build every generalized sequence of length n, for n ≥ 3 as a junction of two smaller overlapping generalized sequences. The modified join step of the Apriori algorithm is described in [12].

Many algorithms can be used to mine association rules from the data available; one of the most used and famous is the Apriori algorithm proposed and detailed in [1]. For our recommender system we need association rules with one single item in the consequence of the rule. A disadvantage of association mining is that it does not inherently use the notion of temporal distance which we believe is crucial for deciding which rules to apply for a given Web transaction. Association mining capture patterns relating to itemsets irrespective of the order in which they occur in a given transaction.

5. RECOMMENDATION

4.2 Frequent Sequences (FS)

We define two criteria to select discovered rules matching the pages requested by a user: Highest Confidence (HC) and Last Sequence (LS).

5.1 Rule function selection

Frequent sequence Mining can be thought of as association mining over temporal datasets and a sequence as an ordered (over time) list of non-empty itemsets. The attempt of this technique is to discover time ordered sequences of URLs that have been followed by past users, in order to predict future ones. We define ti as urli associated time,

The first rule selection criterion is highest confidence. Since different discovered rules antecedents can match pages requested by a user, rules with highest confidence are chosen to predict the next page. The second rule selection (LS) considers the distance between pages requested by a user and the consequence of a rule; this distance is the number of clicks from one page to another. LS strategy selects rules where the requested pages are the closest to the consequent. If different rules are equals considering LS, those with highest confidence are always chosen.

U RLsi = {(urli,1 , ti,1 ), ..., (urli,k , ti,k ), ..., (urli,m , ti,m )} where ti,k < ti,k+1 and ti,m − ti,1