Website link prediction using a Markov chain model ...

66 downloads 50009 Views 182KB Size Report
adaptive web applications, recommendation systems, web server optimisation, web search and web ... Keywords: website next link prediction; Markov chain; transition probability matrix; time weighting .... personal agents (Good et al., 1999) ..... The contents of a web server log file entry include timestamp, host identifier, URL ...
Int. J. Web Engineering and Technology, Vol. 3, No. 3, 2007

Website link prediction using a Markov chain model based on multiple time periods Shantha Jayalal* Department of Industrial Management University of Keleniya Dalugama, Sri Lanka E-mail: [email protected] *Corresponding author

Chris Hawksley and Pearl Brereton School of Computing and Mathematics University of Keele Keele, Staffordshire ST5 5BG, UK E-mail: [email protected] E-mail: [email protected] Abstract: Growing size and complexity of many websites have made navigation through these sites increasingly difficult. Attempting to automatically predict the next page for a website user to visit has many potential benefits, for example in site navigation, automatic tour generation, adaptive web applications, recommendation systems, web server optimisation, web search and web pre-fetching. This paper describes an approach to link prediction using a Markov chain model based on an exponentially smoothed transition probability matrix which incorporates site usage statistics collected over multiple time periods. The improved performance of this approach compared to earlier methods is also discussed. Keywords: website next link prediction; Markov chain; transition probability matrix; time weighting; single exponentially smoothed transition probability matrix; link history. Reference to this paper should be made as follows: Jayalal, S., Hawksley, C. and Brereton, P. (2007) ‘Website link prediction using a Markov chain model based on multiple time periods’, Int. J. Web Engineering and Technology, Vol. 3, No. 3, pp.271–287. Biographical notes: Shantha Jayalal holds a BSc in Industrial Management from the University of Keleniya, Sri Lanka, PG diploma in Computer Science from the University of Colombo, Sri Lanka and a PhD in Computer Science from the School of Computing and Mathematics at Keele University in the UK. Now he is a Lecturer in the Industrial Management Department at University of Kelaniya, Sri Lanka. His research interests are web navigation, link prediction and semantic relatedness of web pages.

Copyright © 2007 Inderscience Enterprises Ltd.

271

272

S. Jayalal, C. Hawksley and P. Brereton Chris Hawksley holds a BSc in Communication Engineering from UMIST, UK, an MSc by research into computational linguistics, also from UMIST, and a PhD in Software Engineering from Keele. Now a Senior Lecturer in the School of Computing and Mathematics at Keele University in the UK, he teaches computer programming, networks and communications, and has authored several academic textbooks. His current research interests lie in pervasive computing, and in web navigation. Pearl Brereton is Professor of Software Engineering in the School of Computing and Mathematics at Keele University in the UK. Within the Software Engineering field, her main research interests are in service-based and component-based systems and in the application of techniques and processes from other disciplines.

1

Introduction

Websites have become the integration hubs for a wide variety of activities such as electronic commerce, communication, entertainment, education and government services as well as being used for the dissemination of information (Nanopoulos et al., 2003). With increasing numbers of web users, there is a need to improve the website navigation experience (Bonino and Corno, 2003) and a range of web applications have emerged recently for this purpose. Many researchers have stressed the importance of link prediction in applications and activities such as website navigation, automatic tour generation, adaptive web applications, recommendation systems, web server optimisation, web search and web pre-fetching (Bonino and Corno, 2003; Gery and Haddad, 2003; Perkowitz and Etzioni, 2000; Sarukkai, 2000; Su et al., 2000; Zhu et al., 2002a; Zukerman et al., 1999). One approach to link prediction is through the use of the Markov chain model. For example, a variation of the model using probability transition matrix has been proposed by Sarukkai (2000) and later Zhu et al. (2002a–b). However, one of the limitations of this approach is that the data used in link prediction is derived from a snapshot of user activity at a particular point in time. It can be argued that this leads to a bias in prediction centred around the particular interests of website users at that instant in time. In this paper, therefore we extend the approach by using a Markov chain model based on an exponentially smoothed transition probability matrix which incorporates site usage statistics collected over multiple time periods. We claim this approach offers improved prediction characteristics compared to the approaches of Sarukkai (2000) and Zhu et al. (2002a–b). The structure of this paper is as follows: Section 2 describes related work and our approach to link prediction is presented in Section 3. In Section 4, we present an empirical evaluation of our approach. Section 5 concludes the paper and discusses the implications for future research.

Website link prediction using a Markov chain model

2

273

Related work

Website users usually browse websites by following hyperlinks from one page to another. Hyperlinks on a page often refer to pages stored on the same server. Typically, there is a pause after each page is loaded, while the user reads the displayed material. The web server can pre-fetch pages during this pause that are likely to be accessed soon, thereby avoiding retrieval latency if and when those files are actually requested. The retrieval latency has not actually been reduced. It has been overlapped with the time the user spends reading, thereby decreasing the access time (Padmanabhan and Mogul, 1996). Several link prediction approaches for pre-fetching applications are explained in the literature. These include dependency graphs (Padmanabhan and Mogul, 1996), probability transition matrices (Bestavros, 1995) and Markov chain models (Zukerman et al., 1999), content analysis of the pages requested recently by the user (Davison, 2002), user path profile analysis (Schechter et al., 1998), sequential behaviour models (Frias-Martinez and Karamcheti, 2002), and popularity-based link prediction models (Chen and Zhang, 2003). A number of link prediction approaches for website navigation applications are also reported in the literature. For example: •

analysis of user behaviour discovered in web log data (Gery and Haddad, 2003; Mobasher et al., 1999; Yan et al., 1996)



comparison of a short description entered by the user with the descriptions of all hyperlinks on the current page and with previous user interests (see Web Watcher (Joachims et al., 1997))



using a collaborative filtering approach (see Ringo (Shardanand and Maes, 1995) and GroupLens (Resnick et al., 1994))



a hybrid system combining both collaborative and content-based filtering approaches for link prediction (see Fab System (Balabanovic and Shoham, 1997))



personal agents (Good et al., 1999)



using path-profiles of site users (see Syskill and Webert (Pazzani et al., 1996) and WhatNext (Su et al., 2000))



using a behaviour-based interface agent (see Letizia (Lieberman, 1995))



based on a variety of Markov chain models (Sarukkai, 2000; Zhu et al., 2002a–b).

The main disadvantage of all the above link prediction approaches is that they are, to some extent, conditioned by the time period over which the website usage data is collected. It is advantageous to consider the current trends and seasonal behaviours of website usage when predicting links for website users (Iyengar et al., 1999) because it is then possible to incorporate recent significant events which affect user browsing behaviour. For example, the addition of a new link when a company launches a new product may change behaviour. We shall argue below that such time factors can be easily incorporated into the link prediction approach presented by Sarukkai (2000) and later Zhu et al. (2002a–b).

274

3

S. Jayalal, C. Hawksley and P. Brereton

Link prediction: our approach

In this section, our approach to link prediction using the Markov chain model is presented. Such a model is constructed to reflect a group of users’ collective behaviour in navigating a website. Probability transition matrices based on Markov theory have long been used for prediction of future activities in the field of Operations Research (Hiller and Lieberman, 2001). Markov chains have been widely used to model user navigation on websites. Web pages can be treated as states, and hyperlinks between web pages as one-step transitions between these states in a Markov chain model. Information about website usage can be used to determine the transition probabilities between these states. The Markov chain model can then be used to predict the web pages that a user is likely to visit given a sequence of web pages already visited by that user or other users. The information needed can be extracted from web server log files (Zhu et al., 2002a–b).

3.1 Constructing the Markov chain model The behaviour of an individual user in navigating a website can be characterised by the collective navigation behaviour of a group of users. A model can be built from this collective navigation behaviour and then used to predict an individual user’s navigation on a website. A finite-state stochastic process1 in which the future probabilistic behaviour of a process depends only on the present state is called a Markov chain (Daellenbach et al., 1983). The finite number of states are numbered as 1,…..N, and at any time the process occupies (or is completely described by) one of these states. It is said that the process makes a transition when a change occurs in the process. Markov chains have the special property that probabilities involving how the process will evolve in the future depend only on the present state2 of the process, and so are independent of events in the past. This Markovian property can be expressed in terms of conditional probabilities for a stochastic process {Xt} at state St: P ( X t +1 ) = St +1 | X t = St , X t −1 = St −1 ,........., X1 = S1 , X 0 = S0 = P( X t +1 = St +1 | X t = St )

for t = 0,1,2……….N In other words, the state at time t + 1 depends only on the state at time t and not of the values of any of the random variables Xt–1,…….X0, because each Xt depends only on Xt–1 and has an effect only on Xt+1. Any stochastic process {Xt} (t = 0,1,2……) is a Markov chain if it has the Markovian property (Hiller and Lieberman, 2001). An additional assumption to the analysis of a Markov process is that the probability of a transition from any state i to any state j is the same for any time t. That is: P {X t+1 = j | Xt = i} = Pij The property that a Markov process’ transitional behaviour does not change over time is called the stationary property.

Website link prediction using a Markov chain model

275

3.2 Transition probability matrix The probability Pij is called the transition probability of a system changing from state S = i at some time t to state S = j at time t + 1. If no transition can occur from state i to state j, Pij = 0. On the other hand, if the system, when it is in state i, can move only to state j at the next transition, Pij = . With this convention, transition probabilities can be defined from each state i to each state j. For a system with m states, the Pij values can be arranged as an (m × m) matrix, called a transition probability matrix. p11

p12

...

p1m

p21

p22

...

p2 m

:

:

:::

:

pm1

pm 2

...

pmm

P=

Each row represents the one-step transition probability3 over all m states. From this, it follows that the row sum, sums of P are equal to 1: m

∑p

=1

ij

j =1

For example, consider the sample link graph illustrating the probabilities of selecting web pages in a section of the Keele University Computer Science website as shown in Figure 1. P12 is the probability of selecting the link to page 2 from page 1. Therefore Pij can be defined as: Pij = Figure 1

Total no of outgoing requests from page i to page j All outgoing requests from page i Link graph of a sample website 1

P12

1-Keele Computer Science Home Page

P17

2-Staff 2 P23 3

7

P61

3-HP-Dr A

P25

P79

5-HP-Dr C

P65

4 P38 P48

4-HP-Dr B

P76

5

P34

P96 6

P58 P86

9

6-Software Engineering Research Group 7-Research

P89

8-Publications 9-GIS Research Group

8

These probabilities can be represented as a transition probability matrix (Q) as shown in Figure 2.

276

S. Jayalal, C. Hawksley and P. Brereton

Figure 2

Transition probability matrix (Q) for the link graph in Figure 1 Page number

1

1

2

3

4

5

6

P12

2

9

P25 P34

P38

4

P48

5 6

8

P17 P23

3

7

P54 P61

P58 P65

7

P76

P79

8

P86

P89

9

P96

And, for any given page i which contains at least one outgoing link: 9

∑p j =1

ij

=1

where Pij = 0 if there is no link from page i to j, otherwise 0 < = Pij < = 1, for i = 1..,9.

3.3 Time weighting It is important to consider the current trends and seasonal behaviours of the website usage when predicting links for website users (Iyengar et al., 1999). To achieve this, greater emphasis is placed on the most recent data available on web usage when forming Q in order to provide the best predictions. On the other hand, the overall set of values for Q requires data collected over a longer period in order to reflect the long-term usage of the site as a whole, even though a higher weighting is given to this recent usage data. The combination of these two features is important in the link prediction computation. This important issue was ignored in earlier methods used for website link prediction (Perkowitz and Etzioni, 2000; Sarukkai, 2000; Zhu et al., 2002a). Those methods consider the website user behaviour equally over a single time period. In our approach, Q is formed using the Single Exponential Smoothing Average (SESA) method. Consider several transition matrices, Qt, Qt–1, Qt–2….Qt–n, that are formed in different time periods. The most recent transition matrix at time t, Qt, is made most influential in calculating the single exponential smoothed moving average QSESA over these n periods given by: QSESA = µQt + µ(1 – µ) Qt–1 + µ(1 – µ)2Qt–2 + µ(1 – µ)3Qt–3 +………+ µ(1 – µ)nQt–n where µ is a constant and 0 < = µ < = 1. Here, greater weight is given to more recent transition probabilities and n previous transition probabilities are taken into account.

Website link prediction using a Markov chain model

277

3.4 Link history Website user’s decision on which link to follow from a web page depends also on which links he or she has already visited before. Therefore, the link history of the particular user is also considered with the current web page to predict the website user’s next link request. Consider a user currently at page i0, and his/her visiting history as a sequence of n pages as {I–n+1, I–n+2, I–n+3,……,io}. The vector L0 = {li}, where li =1 when j = i and li = 0 otherwise, for the current page, and vectors Lk = {ljk} (k = –1,…..,–n + 1), where ljk = 1 when jk = ijk and ljk = 0 otherwise, for the previous pages, are used for link history representation. As an example, consider a particular user who has visited the sample website in Figure 1. With a link history of {1,7,9,6}, then this link history can be represented as vectors L0, L–1, L–2, L–3. Here, L0 represents the current page, that is page 6. According to the above definition: L0 = {0,0,0,0,0,1,0,0,0} L–1 = {0,0,0,0,0,0,0,0,1} L–2 = {0,0,0,0,0,0,1,0,0} L–3 = {1,0,0,0,0,0,0,0,0} These history vectors are used together with the single exponential smoothed moving average transition matrix to predict the website users next link request.

3.5 Link prediction The standard Markov chain model can be used to predict pages that a user is most likely to visit in the next step given the current page. Sarukkai (2000) and later Zhu et al. (2002a–b) proposed a variation which predicts the page(s) to be visited in the next step given a sequence of previous pages visited by a user. Using Markov theory, when L0 is multiplied by the probability transition matrix, Q, the probabilities of moving to the other pages in the next step can be calculated. Sarukkai (2000) proposed a variant of the Markov chain model to accommodate weighting of more than one step in a user sequence. In his method for the current step, L0 , only one step further is needed for predicting the next step. For the step L–1, two further steps are needed for predicting the next step. For the (m – 1)th step, L–m+1, m further steps are needed for predicting the next step. Therefore, given the link history vectors (L0, L–1 ….. L–m+1) of a user with a transition probability matrix Q, then in order to calculate vector N, the probability of each page to be visited in the next step we use: N = a1 × L0 × Q + a2 × L–1 × Q2 +…….+ am × L–m+1 × Qm where a1,a2….am are the weights assigned to the history vectors. Normally, 1 > a1 > a2 >….> am >0, so that the closer the history vector is to the present, the more influence it has in the future. A visiting history sequence is represented as m number of pages from 0 to m – 1. Each and every followed link of the link history is represented as a vector with a probability 1 at that state for that time (denoted by L0, L–1 ….. L–m+1). Vector, L0 represents the current page of the user.

278

S. Jayalal, C. Hawksley and P. Brereton

Vector N represents the probabilities for selection of the next page from the current page. The page with the highest probability value can be used as the next page prediction. The proposed method uses the QSESA instead of Q in order to pay more attention to recent user behaviour while giving less attention to previous user behaviour within the website.

4

Empirical evaluation

Keele University Department of Computer Science website was used for experimental evaluation of this link prediction approach. At the time of the evaluation, the site consisted of 2167 pages and 12 347 interconnections between them. There were about 8934 requests per day during the experimental evaluation period. Web usage data was calculated by analysing the log file that records all the requests for the site.

4.1 Determining website usage data A number of different mechanisms for obtaining website usage data have been developed, such as web server log file analysis, guest books and feedback forms, user surveys and focus groups and software agents. Web server log file analysis was chosen for determining usage data for this study, mainly because the log files were readily available. A web server access log records all requests processed by a web server. While log file analysis does suffer from various weaknesses as discussed below, improvements can be made by the introduction of cookies (Bryan and Gary, 2000). However, for privacy reasons, many website users are reluctant to accept cookies (Mobasher et al., 1999; Pirolli and Pitkow, 1999). Web server log files are large text files generated by web servers. They contain records of any activity that took place between a web server and browsers during a particular time period. The contents of a web server log file entry include timestamp, host identifier, URL request, referrer, agent, etc. Every log entry conforming to the Common Log Format (CLF) contains the following fields: client IP address or host name, access time, HTTP request method used, path of the accessed resource on the web server (identifying the URL), protocol used (HTTP /1.0, HTTP /1.1), status code, number of bytes transmitted, referrer, user-agent, etc. The referrer field gives the URL from which the user navigates to the requested page. The user agent is the software used to access pages. It can be a spider (GoogleBot, OpenBot, Scooter, etc.) or a browser (Mozilla, Internet Explorer, Opera, etc.) (Gery and Haddad, 2003). Example contents of the Keele University web server log file which are relevant to the Department of Computer Science web pages are shown in Figure 3. While log file analysis does provide some measure of determining site usage, it suffers from several flaws. One major problem is caching. This may be browser caching, local site caching, local regional caching or wider regional caching. If any one of these caches finds the page corresponding to a sought for URL it will serve the page directly, and the web server will never know that a request was made. Several commercial web analysis products and some researchers employ complex heuristics in order to make educated guesses about information that is excluded from log files (Bryan and Gary, 2000; Chen et al., 1996; Zhu et al., 2002a–b). Some researchers have assumed that caching does not much affect their research into web log analysis

Website link prediction using a Markov chain model

279

(Perkowitz and Etzioni, 2000), partly because of the complexity of those heuristics and their poor performances. If desired, more accurate logs can be generated using cookies or visitor-tracking software such as WebTrends4 (Perkowitz and Etzioni, 2000). In this research no account has been taken of caching. Figure 3

Sample web server logs

198.22.3.6 – [12/Jan/2004:00:03:21 +0000] "GET /depts/cs/Noticeboard/Examtt.html HTTP/1.1" 200 – "http://www.keele.ac.uk/cgi-bin/htsearch.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FunWebProducts; Preload_01_07)" 198.22.3.6 – [12/Jan/2004:00:03:26 +0000] "GET /depts/aa/exams/timetables.htm HTTP/1.1" 200 – "http://www.keele.ac.uk/depts/cs/Noticeboard/Examtt.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FunWebProducts; Preload_01_07)" 198.22.3.6 – [12/Jan/2004:00:07:22 +0000] "GET /depts/aa/exams/timetables.htm HTTP/1.1" 200 – "http://www.keele.ac.uk/depts/cs/Noticeboard/Examtt.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FunWebProducts; Preload_01_07)" 198.22.3.6 – [12/Jan/2004:00:07:22 +0000] "GET /depts/cs/Noticeboard/Examtt.html HTTP/1.1" 200 – "http://www.keele.ac.uk/cgi-bin/htsearch" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FunWebProducts; Preload_01_07)"

The main source of information used in this research is the site’s web server log, which records the pages viewed by each visitor to the site.5 For this experiment the only information used was the IP address of the machine from which the request originated, the date and time of the request, the URL requested, status code and the referrer. It is also assumed that each originating machine corresponded to a single user.6

4.2 Experimental design For this experiment two sets of web logs are used, obtained from the Keele University web server, that are related to web pages of the Computer Science Department. The first log set is from 1 December 2002 to 31 December 2002 (log-version-1.txt) and the second log set from 1 September 2003 to 31 December 2003 (log-version-2.txt). The log-version-1.txt is used as a training data set and the log-version-2.txt for experimental evaluation purposes. The training data set was used to obtain appropriate values for constant µ and weights a1, a2,…. am by trial and error method. The experimental process is illustrated in Figure 4 and explained in the sections which follow. Firstly, the two log data sets were pre-processed as explained in Section 4.2.1 and two types of transition matrixes were generated. One is the conventional transition matrix and the other one is an exponentially smoothed transition matrix. Using this pre-processed web log data, user sessions were identified. User sessions with three to seven visited pages were then used for link prediction. This is because if the session size is more than six there were fewer user sessions in the experimental data sets The pages other than the last page in these sequences were used to represent link history vectors and the last page was predicted using the two transition matrixes from the two methods (Sarukkai’s and ours). The predicted pages from the two methods were compared with the actual result of the user session, as recorded in the web log.

280

S. Jayalal, C. Hawksley and P. Brereton

Figure 4

Experimental design for link prediction evaluation

transition matrix (conventional)

log data set

link prediction

transition matrix generation

pre-processing

transition matrix (SESA)

cleaned log data set

N-LPSESA

N-LP Sarukkai session identification

user sessions

website topology

There were 49 986 entries in the original log-version-1.txt data set and 1 089 954 entries in the original log-version-2.txt data set before cleaning. 2167 unique URL’s were in use during the experiment period (numbered as 1 to 2167 and stored in links.txt).

4.2.1 Pre-processing web usage data on the website A critical step in effectively mining web usage data is the cleaning and transformation of web log data into meaningful information and the identification of a set of user sessions. Cleaning the server logs involves removing redundant references (e.g., images and sound files, multiple frames and dynamic page references, etc.) leaving only one entry per page view. For this experimental evaluation, only the referenced and referring pages with .html or .htm extensions and page addresses starting with http://www.keele.ac.uk/depts/cs/ were kept and all other entries were removed from the log data sets. Thereafter this cleaned log data set was processed together with the site topology file, which contains all the sequentially numbered URLs belonging to the http://www.keele.ac.uk/depts/cs/ domain. This process replaces all the URLs in the cleaned log data set by integer numbers, making further processing easier. After cleaning there remained 19 079 and 347 833 entries in the log-version-1.txt and log-version-2.txt files, respectively.7

4.2.2 Transition matrix A conventional transition matrix (Q) was obtained from the cleaned log data set (log-version-2.txt) without considering the time effect for twelve weeks (91 days) from 1 September 2003 to 30 November 2003. The four weeks of data from 1 December 2003 to 31 December 2003 was used for testing link prediction. The size of the matrix is 2167 × 2167, representing the entire unique URL’s available within our experimentation period.

Website link prediction using a Markov chain model

281

4.2.3 Single exponentially smoothed average transition matrix The time effect was considered in order to form the single exponentially smoothed average transition matrix (QSESA). For each and every day, one conventional transition matrix was formed from our cleaned log data set (log-version-2.txt) considering the time effect for 12 weeks (91 days) from 1 September 2003 to 30 November 2003 and identified as Q1, Q2, Q3, … Q91. The four-week data from 1 December 2003 to 31 December 2003 was used for testing link prediction. One day was chosen as the time interval between matrices as a compromise between calculation efficiency and the desire to reflect change over time in a reasonable manner. The most recent transition matrix at day 91, Q91, is made most influential while the transition matrix at day 1, Q1, is made least influential in calculating QSESA: QSESA = µQ91 + µ(1 – µ)Q90 + µ(1 – µ)2Q89 + µ(1 – µ)3Q88 +………+ µ(1 – µ)90Q1 90

QSESA = ∑ µ (1 − µ )t Q(91−t ) t =0

The constant µ was set by trial and error to 0.25, which gives more accurate link predictions from the training data set. The size of QSESA is the same as Q, that is 2167 × 2167.

4.2.4 Session identification The session of activity that a user with a unique IP address spends on a website during a specified period of time is called a user session (Kate and Alan, 2002). If the user comes back to the site within that time period, it is considered to be the same one user session, and any number of visits within that time period will only count as one session. If the user returns to the site after the allotted time period has expired then it is counted as a separate user session. Client-side and proxy-level caching often create impediments to the identification of unique user sessions. For example, in a web server log, all requests from a proxy server have the same identifier, even though the requests potentially represent more than one user. Techniques such as the use of client-side cookies for user identification are not always a practical solution to this problem due to privacy concerns of the user (Mobasher et al., 1999; Pirolli and Pitkow, 1999). In this experiment, session identification is required only to test our research questions and hypothesis. It would be possible to use some of the techniques mentioned above in a real world implementation of link prediction to identify user sessions more accurately. But for our experimental evaluation, we assume that each originating machine corresponds to a single user. In practice, this assumption will not much affect our conclusions. The cleaned log data set was sorted by originating machine identification (and hence user identification), and then entries with same user identification were sorted by date and time of the log entry request. It was assumed that the user had started a new session if, for a particular user, the time gap between two requests is greater than 30 minutes (Cooley et al., 1999; Nielsen, 2000). Many commercial website visitor-tracking applications use 30 minutes as a default timeout, and Catledge and Pitkow (1995) have established a timeout of 25.5 minutes based on empirical data. Therefore 30 minutes is selected as the

282

S. Jayalal, C. Hawksley and P. Brereton

timeout for identifying user sessions. Only user sessions which have at least three entries (session size = 3) and a continuous request flow (without missing pages due to the caching effect) were chosen for this experiment. The identified user sessions with their session sizes8 from our four-week cleaned log-version-2.txt data set from 1 December 2003 to 31 December 2003 which satisfy the above conditions are given in the following Table 1. Table 1

Session size and the number of user sessions found

Session size

Number of sessions

3

765

4

413

5

276

6

204

4.2.5 Link prediction In this experiment the link history size was set to a value between 2 and 5 inclusive (link history size is session size –1). This is mainly because these values identify a significant number of user sessions (see Table 1) from our four-week log data set. If the session size is more than six there are much fewer user sessions. For example, as explained in Section 4.5 using five link history vectors L0, L–1, L–2, L–3, and L–4, with QSESA, the link prediction is performed as follows: N-LPSESA = a1L0QSESA + a2L–1QSESA2 + a3L–2QSESA3 + a4L–3QSESA4 + a5L–4QSESA5 In this situation, Vector N-LPSESA gives the probabilities of selecting the next page (which is the sixth page) from the current page (page 5) by the user. N-LPSarukkai was also calculated, using Q instead of QSESA. The values for weights a1, a2, a3, a4, and a5 were set as 0.9, 0.5, 0.3, 0.2 and 0.1, respectively because they gave good results for the training data set for both approaches. The page related to the maximum value of N-LPSESA was selected as the page predicted by the LPSESA approach and the page related to the maximum value of N-LPSarukkai was selected as the page predicted by the LPSarukkai approach for at particular user session. This process was repeated for all user sessions and the results are explained in the next section.

4.2.6 Results analysis It is clear from Table 2 and Figure 5 that the LPSESA approach gives more accurate link prediction than the LPSarukkai approach for all user sessions considered (link history size from 2 to 5). As one would expect, both approaches give more accurate predictions as the link history size increases. The LPSESA approach gives more accurate prediction at link history sizes above 2. At link history size 2, LPSESA method predicts links 8% more accurately than the LPSarukkai method. But at link history size 3, LPSESA method predicts links 15% more accurately than the LPSarukkai method. At link history sizes 4 and 5, LPSESA method predicts links, respectively, 7% and 6% more accurately than the LPSarukkai method. Therefore all link history sizes our method gives more accurate link predictions and the LPSESA method perform most significantly better at lower link history sizes than the

Website link prediction using a Markov chain model

283

LPSarukkai method. The research findings of Su et al. (2000) on their prediction system for web requests using an N-gram sequence model suggest that accuracy of the link prediction is increased by 20% or more when the link history size is greater than 3. Table 2

Results of the experimental evaluation of link prediction using LPSarukkai approach and LPSESA approach Link prediction LPSarukkai

Link history size

Total number of sessions

Correct prediction

LPSESA

Incorrect prediction

Correct prediction

Percentage Percentage Percentage Number (%) Number (%) Number (%)

Incorrect prediction Number

Percentage (%)

2

765

365

47.7

400

52.3

425

55.6

340

44.4

3

413

199

48.2

214

51.8

263

63.7

150

36.3

4

276

157

56.9

119

43.1

177

64.1

99

35.9

5

204

117

57.4

87

42.6

130

63.7

74

36.3

Figure 5

Link history size versus link prediction accuracy percentage

Link prediction accuracy percentage (%)

75.0

65.0

55.0

sLPSarukkai

45.0

o LPSESA

35.0 2

3

4

5

Link History Size

Since the data is paired and the outcome is ordinal the Wilcoxon Signed Ranks Test (Sprent and Smeeton, 2001) was applied to identify the statistical significance of the results. This gives the statistical significance, p < 0.005. Therefore the null hypothesis can be rejected, and this is evidence to claim that the LPSESA approach of link prediction based on Markov theory using a single exponentially smoothed average transition matrix incorporating time factor is able to predict links which will be followed by website users more accurately than the approach proposed by Sarukkai (2000).

284

5

S. Jayalal, C. Hawksley and P. Brereton

Discussion, conclusions and future work

The results in the previous section show a significant improvement in link prediction when using the LPSESA approach. This supports the conclusion that link prediction based on Markov theory using a single exponentially smoothed average transition matrix which incorporates site usage data over a time period predicts links more accurately than the link prediction based only on Markov theory using a conventional transition matrix. This conclusion is based on experimental evaluation carried out using the web logs of the Computer Science Department, Keele University and may be valid for websites having a similar structure and browsing pattern. No evidence has been found in the literature which would argue that the conclusion may be different for different browsing patterns and structures of websites. However, without performing experimental evaluations using web logs from websites which have different structures and different browsing patterns, it is not possible to generalise the conclusion for all link prediction at this stage. Further evaluation of the link prediction approach could be carried out, using different web log data sets over a long period of time. This would make it possible to form different exponentially smoothed probability transition matrices based on different time intervals and provide more insight into the time effect on link prediction. The main limitation of the LPSESA approach to link prediction is the matrix multiplication. The single exponentially smoothed transition matrix, QSESA, has 2167 × 2167 elements. For this evaluation it was necessary to take the fifth power of QSESA. If history size rises above five, it is necessary to raise QSESA to even higher powers. Also, if the number of web pages in the website rises, so does the size of QSESA. To limit this effect, matrix compression approaches can be applied to the transition matrix. Zhu et al. (2002a) have used Spears (1998) algorithm for compression of their transition matrix, of course sacrificing some level of accuracy of link prediction. Compression has not been attempted in this study. Even though the transition matrix generation process is offline, it requires a considerable amount of processing time. This will increase when the number of pages available and the number of page accesses increase. In this experimental evaluation, computation could be completed within a reasonable time on relatively modest server hardware. But with larger websites and/or very high user access it may need considerable time to complete this. Therefore more powerful web servers and more efficient pre-processing algorithms may be needed to overcome these kinds of situations. We do not attempt to address this limitation in this study. The link prediction approach can be extended in several ways. User behaviour over time is considered in this research but differences in user behaviour between different user groups is not considered. It should be possible to segment the web log data into groups of users having similar navigation behaviour and to incorporate this group behaviour into the link prediction mechanism. Even though it was assumed that the caching effect does not degrade the accuracy of link prediction, it would be better to reduce this effect before forming the transition matrix. This can be addressed by introducing cookies, visitor-tracking software such as WebTrends, or educated guess approaches like the maximal forward approach (Chen et al., 1996).

Website link prediction using a Markov chain model

285

How many web pages in the user history are appropriate when calculating the link prediction is also an important research question. In our study, three pages or more were found to be appropriate. However, it would be interesting to study this further using different web logs, different session sizes and different website contents and structures. This history value is very important not only for link prediction but also for website designers when they are considering the choice of link structure for a site.

Acknowledgement We would like to thank those who have contributed many useful comments. In particular, we would like to express appreciation to the anonymous reviewers for their useful comments, which helped us to improve the final presentation. The work was supported by the University of Kelaniya, Sri Lanka, the Ministry of Science and Technology, Sri Lanka and the Asian Development Bank.

References Balabanovic, M. and Shoham, Y. (1997) ‘Fab: content-based, collaborative recommendation’, Communications of the ACM, Vol. 40, No. 3, pp.66–72. Bestavros, A. (1995) ‘Using speculation to reduce server load and service time on the WWW’, Proceeding of the 4th ACM International Conference on Information and Knowledge Management (CIKM), Baltimore, MD, USA, pp.403–410. Bonino, D. and Corno, F. (2003) ‘An evolutionary approach to web request prediction’, Proceedings of the 12th International World Wide Web Conference, Budapest, Hungary, May. Bryan, W. and Gary, M. (2000) ‘Using access information in the dynamic visualization of web sites’, Proceedings of the 3rd South African Telecommunications, Networks and Applications Conference (SATNAC’00), Stellenbosch, South Africa. Catledge, L.D. and Pitkow, J.E. (1995) ‘Characterizing browsing strategies in the WWW’, Computer Networks and ISDN Systems, Vol. 27, pp.1065–1073. Chen, M.S., Park, J.S. and Yu, P.S. (1996) ‘Data mining for path traversal patterns in a web environment’, 16th International Conference on Distributed Computing Systems, Hong Kong. Chen, X. and Zhang, X. (2003) ‘A popularity-based prediction model for web prefetching’, IEEE Computer Society, Vol. 36, No. 3, pp.63–70. Cooley, R., Mobasher, B. and Sirivastava, J. (1999) ‘Data preparation for mining WWW browsing patterns’, Knowledge and Information Systems, Vol. 1, No. 1, pp.5–32. Daellenbach, H.G., George, J.A. and McNickle, D.C. (1983) Introduction to Operations Research Technique, Allyn and Bacon, Inc., ISBN 0-205-07718-8. Davison, B.D. (2002) ‘Predicting web actions from HTML content’, Proceedings of the 13th ACM Conference on Hypertext and Hypermedia (Hypertext’02), Collage Park, MD, USA. Frias-Martinez, E. and Karamcheti, V. (2002) ‘A prediction model for user access sequences’, Proceedings ACM International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July. Gery, M. and Haddad, H. (2003) ‘Evaluation of web usage mining approaches for user’s next request prediction’, Proceedings of the 5th ACM International Workshop on Web Information and Data Management, New Oreleans, LA, USA.

286

S. Jayalal, C. Hawksley and P. Brereton

Good, N., Schafer, B., Konstan, J.A., Borchers, A.I., Sarwar, B., Herlocker, J. and Riedl, J. (1999) ‘Combining collaborative filtering with personal agents for better recommendations’, Proceedings of American Association for Artificial Intelligence (AAAI-99), Orlando, FL. Hiller, F.S. and Lieberman, G.J. (2001) Introduction to Operations Research, McGraw-Hill Companies, Inc., ISBN 0-07-232169-5. Iyengar, A.K., Squillante, M.S. and Zhang, L. (1999) ‘Analysis and characterization of large-scale web server access patterns and performance’, World Wide Web, Vol. 2, Nos. 1–2, pp.85–100. Joachims, J., Freitag, D. and Mitchell, T. (1997) ‘Web Watcher: a tour guide for the World Wide Web’, Proceeding of the International Joint Conference on Artificial Intelligence (IJCAI-97), Nagoya, Japan, August. Kate, A.S. and Alan, N. (2002) ‘Web page clustering a self-organizing map of user navigation patterns’, Decision Support Systems, Vol. 995. Lieberman, H. (1995) ‘Letizia: an agent that assists web browsing’, Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-95), Montreal, Quebec, Canada, pp.925–929. Mobasher, B., Cooley, R. and Sirivastava, J. (1999) ‘Creating adaptive web sites through usage-based clustering of URLs’, Proceedings of 3rd IEEE International Workshop on Knowledge and Data Engineering Exchange (KDEX’99), Chicago, November, pp.19–25. Nanopoulos, A., Katsaros, D. and Manolopoulos, Y. (2003) ‘A data mining algorithm for generalized web prefetching’, IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 5. Nielsen, J. (2000) Designing Web Usability, Indianapolis, USA: New Riders Publishing, ISBN 1-56205-810-X. Padmanabhan, V.N. and Mogul, J.C. (1996) ‘Using predictive prefetching to improve World Wide Web latency’, Computer Communication, Vol. 26, No. 3, pp.22–36. Pazzani, M., Muramatsu, J. and Billsus, D. (1996) ‘Syskill & Webert: identifying intersting web sites’, Proceedings of American Association for Artificial Intelligence (AAAI-96), Portland, OR. Perkowitz, M. and Etzioni, O. (2000) ‘Towards adaptive web sites: conceptual framework and case study’, Artificial Intelligence, Vol. 118, pp.245–275. Pirolli, P. and Pitkow, J.E. (1999) ‘Distributions of surfer’s paths through the World Wide Web: empirical characterizations’, World Wide Web, Vol. 2, Nos. 1–2, pp.29–45. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P. and Riedl, J. (1994) ‘GroupLens: an open architecture for collaborative filtering of Netnews’, Proceeding of ACM 1994 Conference on Computer Supported Cooperative Work, Chapel Hill, NC, USA, pp.175–186. Sarukkai, R.R. (2000) ‘Link prediction and path analysis using Markov chains’, Computer Networks, Vol. 33, pp.377–386. Schechter, S., Krishnan, M. and Smith, M.D. (1998) ‘Using path profiles to predict HTTP requests’, Proceedings of the 7th International WWW Conference, Brisbane, Australia. Shardanand, U. and Maes, P. (1995) ‘Social information filtering: algorithms for automating “Word of Mouth”’, Proceedings of ACM Conference on Human Factors in Computing Systems (CHI 95), New York, USA, pp.210–217. Spears, W.M. (1998) ‘A compression algorithm for probability transition matrices’, SIAM Matrix Analysis and Applications, Vol. 20, No. 1, pp.60–77. Sprent, P. and Smeeton, N.C. (2001) Applied Nonparametric Statistical Methods, Chapman & Hall/CRC, ISBN 1-58488-145-3. Su, Z., Ye Lu, Q. and Zhang, H. (2000) ‘WhatNext: a predition system for web requests using N-gram sequence models’, Proceeding of International Conference on Web Information Systems Engineering (WISE2000), Hong Kong, June.

Website link prediction using a Markov chain model

287

Yan, T.W., Jacobsen, M., Garcia-Molina, H. and Dayal, U. (1996) ‘From user access patterns to dynamic hypertext linking’, Computer Networks and ISDN Systems, Vol. 28, pp.1007–1014. Zhu, J., Hong, J. and Hughes, J.G. (2002a) ‘Using Markov chains for link prediction in adaptive web sites’, Proceeding of 1st International Conference on Computing in an Imperfect World (Soft-Ware 2002), Belfast, Nothern Ireland, Springer Verlag LNCS, April, pp.60–73. Zhu, J., Hong, J. and Hughes, J.G. (2002b) ‘Using Markov models for web site link prediction’, Proceedings of 13th ACM Conference on Hypertext and Hypermedia, USA, pp.169–170. Zukerman, I., Albrecht, D.W. and Nicholson, A.E. (1999) ‘Predicting users’ request on the WWW’, Proceedings of 7th International Conference of User Modeling, Canada, pp.275–284.

Notes 1 2 3 4 5 6

7 8

Processes that evolve over time in a probabilistic manner. The observed characteristic or condition of the system at any given moment. The system changes occur directly in one time period. http://www.webtrends.com This website is restricted to a collection of HTML pages residing on a single server. Dynamically generated pages or multiple servers are not yet handled in this study. In fact, this is not always true, because proxies caching two users simultaneous requests are recorded in web logs as visits to the site from the same machine. Fortunately, such coincidences do not affect this experiment substantially because we are interested mainly in overall site usage data not individual users site usage. log-version-1.txt and log-version-2.txt files can be obtained by requesting from the first author. Total number of different pages accessed within the session.

Suggest Documents