Creating meaningful data from web logs for improving the ...

15 downloads 60482 Views 1MB Size Report
article (e.g. in Word or Tex form) to their personal website or ... Web usage mining is to analyze web log files to discover user accessing patterns of web pages.
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Author's personal copy

Expert Systems with Applications 36 (2009) 6635–6644

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method Resul Das a,*, Ibrahim Turkoglu b a b

Department of Informatics, Firat University, 23119 Elazig, Turkey Department of Electronics and Computer Science, Firat University, 23119 Elazig, Turkey

a r t i c l e

i n f o

Keywords: Web mining Web usage mining Web log files Path analysis

a b s t r a c t Web usage mining is to analyze web log files to discover user accessing patterns of web pages. In order to effectively manage and report on a website, it is necessary to get feedback about activity on the web servers. The aim of this study is to help the web designer and web administrator to improve the impressiveness of a website by determining occurred link connections on the website. Therefore, web log files are pre-processed and then path analysis technique is used to investigate the URL information concerning access to electronic sources. The proposed methodology is applied to the web log files in the web server of Firat University. The results and findings of this experimental study can be used by the web designer in order to plan the upgrading and enhancement to the website. Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction With the explosive growth of knowledge sources available on the World Wide Web, it has become more important to find the useful information from these huge amounts of data. At the same time, in the number of websites presents a challenging task for web designers to organize the contents of the websites to provide to the needs of web users. The solutions to these problems can be provided by path analysis using web user navigation patterns. In addition, web designers can improve the design and organization of websites based on the obtained solutions (Das, Turkoglu, & Poyraz, 2007; Etzioni, 1996; Gunduz, 2003; Kosala & Blockeel, 2000). As many researchers believe, it was Etzioni who first came up with the term of web mining in his paper. Web mining is described as the use of data mining techniques to automatically discover and extract useful information from the web documents and services (Etzioni, 1996). In general, web mining research can be classified into three categories: web content mining, web structure mining, and web usage mining (Kosala & Blockeel, 2000). While web structure and content mining utilize primary data on the web, web usage mining works on the secondary data such as web server access logs, proxy sever logs, referrer logs, browser logs, error logs, user profiles, registration data, user sessions or transactions, cookies, user queries, and bookmark data (Gunduz, 2003). Through analyzing these log files and documents we can access to interesting usage patterns and information. In recent years, most research activities in web mining have centred on web usage mining. Web usage mining techniques have * Corresponding author. E-mail address: rdas@firat.edu.tr (R. Das). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.08.067

been widely used for discovery of interesting usage patterns from web server log files (Das et al., 2007). A project aiming an automatic classification of web user navigation patterns and propose a novel approach to classifying user navigation patterns and predicting users’ future requests was presented in Liu and Keselj (2007). Araya, Silva, and Weber (2004) provides such a methodology that is based on suggestions from literature and own experience from various web mining projects. Its application in a Chilean Bank was shown how a combined use of data from a data warehouse and web data can contribute to improve marketing activities. Srivasta, Cooley, Deshpande, and Tan (2000) defines in web mining; data can be collected at the server-side, client-side, proxy servers or a consolidated web database. Soft computing methods (neural networks, fuzzy logic, genetic algorithms, and rough sets, etc.) have been intensively used in web usage mining studies. Some of them are described in Tug, Sakiroglu, and Arslan (2006), Pal, Talwar, and Mitra (2002), Zaiane, Xin, and Han (1998) and Khasawneh and Chan (2005). The paper is organized as follows. Section 2 describes several background objects related to web usage mining and the path analysis method that is used in this study. The implementation of proposed methodology for Firat University was extensively described in Section 3. Obtained useful results were presented in Section 4. Finally, the conclusion is represented in Section 5. 2. Background 2.1. Related works Up to now, many papers have been suggested to analyze the web access log files by using web usage mining techniques. Xue et al.

Author's personal copy

6636

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

propose a novel re-ranking method based on site logs. In addition, they obtained from web logs each page’s access frequency and the traversal patterns of information finding (Xue, Zeng, Chen, Ma, & Lu, 2002). Cooley et al. did in-depth research to all the procedure of web usage mining. They in Cooley, Mobasher, and Srivastava (1999a, 1999b) discuss methods to pre-process the user log data and to separate web page references into those made for navigational purposes and those made for content purposes. Proposed a user browsing behaviour model which assumes that a given user’s treatment of each page is either for the purpose of ‘navigation’ or ‘actual content,’ and this is determined by the page references and associated time obtained in web server logs (Cooley et al., 1999a). In addition, Cooley’s PhD. thesis (Cooley, 2000) provides a comprehensive overview of the work in web usage data pre-processing. Spiliopoulou et al. have been used to mine for path traversal patterns and to facilitate the best design and organization of web pages (Spiliopoulou & Pohle, 2001; Spiliopoulou, Pohle, & Faulstich, 2000). Azhar explores the use of web usage mining techniques to analyze web log records collected from e-learning portal using apriori algorithm (Azhar, 2005). Oosthuizen, Wesson, and Cilliers (2006) discusses and analyzes web logs for visual web mining of organizational websites using data mining algorithms. Drott (1998) explains the various web server logs mining methods that could be used to improve site design. In Sarukkai (1999), Sarukkai has discussed about link prediction and path analysis for better user navigations. He proposes a Markov chain model to predict the user access pattern based on the user access logs previously collected. Zhu, Hong, and Hughes (2002) extend this by introducing the maximal forward reference to eliminate the effect of backward references by the user. They also predict user behaviour within the ‘n’ future steps, using an N-Step Markov chain as opposed to the one step approach by Sarukkai. Information foraging theory concepts have also been used recently by Chi, Pirolli, Chen, and Pitkow (2001) to incorporate user behaviour into the existing content and link structure. They have modelled user needs and user actions using the notion of Information Scent as described earlier (Desikan & Srivastava, 2003). 2.2. Web usage mining Web usage mining is to reveal the knowledge hidden in the log files on one or more websites. The goal is to capture, model, and analyze the behavioural patterns and profiles of users interacting with a website (Liu, 2007; Wang & Liu, 2003). Web usage mining uses on the secondary web data such as web server access logs, proxy server logs, browser logs, user profiles registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other generated by the interaction between users and the web (Das et al., 2007). The discovered patterns are usually represented as collections of pages, objects, or resources that are frequently accessed by groups of users with common needs or interests. The web usage mining process is shown in Fig. 1. The main application areas of web usage mining are shown in Fig. 2. These areas are personalization, system improvements, site

Data acquisition and pre-processing

Web Usage Mining

System Improvement

Personalization

Site Modification

Business Intelligence

Fig. 2. Main application areas of web usage mining.

modification, business intelligence and usage characterization. A detailed overview about these areas can be found in Das et al. (2007) and Cooley (2000). 2.3. Path Analysis Path analysis is a method for causal modelling, was first described by Wright (1921, 1934) as a means of determining the influence of independent factors on dependent factors (Sahinler & Gorgulu, 2000). The path model is usually depicted in a circleand-arrow figure. Each circle represents variable. In path model, single arrows indicate causation between exogenous or intermediary variables and the dependents. Arrows also connect the error terms with their respective endogenous variables. Double arrows indicate correlation between pairs of exogenous variables. A path analysis is a hierarchical multiple regression analysis, used to test the fit of the correlation matrix against two or more causal models (http://support.sas.com/documentation/ (last accessed: 20.01.2008)). Each path coefficients are calculated by Eq. (1). This coefficient represents that how affect the output variable by the unit change on the input variable.

Pyx ¼ b

Sx Sy

ð1Þ

where Pxy represents the path coefficient, direct effect of unit changes of input variable on the output variable. Where, b is partial regression coefficient. Sx, Sy are calculated by the following equations:

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r  1 X ðx  xÞ2  n ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r ffi 1 X 2 Þ  ðy  y Sy ¼ n

Sx ¼

ð2Þ ð3Þ

There are two states about analyzed variable. First state is, if there is no relation between causal variables as shown in Fig. 3. There are no relations between causal variables X1 and X2. The causal variables effect on output can be evaluated by regression Eq. (4). The total variation of output portioned to causal variation effects then Eq. (5) is obtained. Where r2y is variation of Y, r2x1 and r2x2 causal effect variation and b represent partial regression coefficient. When Eq. (5) divided to r2y then Eq. (6) is obtained. The coefficients of Eq. (6) are called as normalized regression coefficients or path coefficients,

Pattern Analysis

Pattern Discovery Preprocessed Data

Web Logs

Usage Characterization

Rules, Patterns, and Statistics

Fig. 1. The web usage mining process.

Interesting Rules, Patterns, and Statistics

Author's personal copy

6637

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

Y ¼ b1 X 1 þ b2 X 2 þ b3 X 3 þ e

X2

X1

ð7Þ

r2y ¼ b1 r2x1 þ b2 r2x2 þ b3 r2x3 þ 2CovðX 1 þ X 2 Þ þ 2CovðX 1 þ X 3 Þ þ 2CovðX 2 þ X 3 Þ

PYX1

PYX2

ð8Þ

CovðX 1 ; b1 X 1 þ b2 X 2 þ b3 X 3 þ eÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rðX 1 ; YÞ ¼ VarðX 1 Þ  VarðYÞ

Y

rðX 1 ; YÞ ¼ P YX 1 þ r X1 X 2 PYX 2 þ r X1 X 3 PYX 3

rx1x3

X1

3. Methodology and implementation

rx2x3 X2

e

X3

In this section, the realized web usage mining implementation is described in details. We present the application of the proposed methodology for analyzing of the web log files. In this study, we have developed an expert system to assist the web designer and web administrator to improve their website by determining occurred link connections in the website. Firstly, we have obtained access log files which are recorded in web server of the Firat University. The obtained log files were analyzed by proposed web usage mining methodology in SAS software 9.1 (SAS software licence number: 291468). We present an overview of the tasks for each step and discuss the challenges involved. Fig. 5 illustrates the overall data flow of the general architecture system, which consist of three main tasks for performing web usage mining: pre-processing, pattern discovery and pattern analysis. Cooley, Mobasher, and Srivastava (1997) has given an excellent discussion on the entire web usage mining process. The proposed methodology is shown in Fig. 6.

Y

Fig. 4. The relations between X1, X2, X3 and Y.

Y ¼ b1 X 1 þ b2 X 2

ð4Þ

r2y ¼ b1 r2x1 þ b2 r2x2 rx rx 1 ¼ b1 1 þ b2 2 ry ry

ð5Þ ð6Þ

Second, if there is not any relation between causal variables as shown in Fig. 4. In other word, the causal variables are dependent variables. Then the covariance effects must be considered. The regression equation for Fig. 4 is given in Eq. (7). When covariance (Cov) effects between causal variables are considered, the variation (V) of Y reformulates as in Eq. (8). The correlation (r) between X 1 and Y can be evaluated by Eq. (9), this equation retyped by path coefficient in Eq. (10) (Sahinler & Gorgulu, 2000),

3.1. Data collection and pre-processing An important task in web usage mining application is the creation of a suitable pre-processed usage data set. This process is usually complex and critical to the successful extraction of useful from the log files in web usage mining. Purpose of the pre-processing is to offer a structural, reliable and integrated data source for pattern

Pre-processing Data Cleaning Server Log Data

Transaction Identification Clean Log

ð10Þ

In this study, the aim of path analysis is to determine the paths that visitors take as they navigate through a website. We had to pre-process the dataset by keeping only the off-campus users with user id information. Also, in order to create sequential information, we had to adjust the dataset by adding a sequence number variable so that it can be used in path analysis (Battioui, 2007).

Fig. 3. The relations between X1, X2 and Y.

rx1x2

ð9Þ

Data Integration

Pattern Discovery

Pattern Analysis

Transformation

Transaction Data

Formatted Data

Log -------------------------------------------------------------------------------------------------------

Path Analysis Integrated Data Association Rules

OLAP/ Visualization Tools

Knowledge Query Mechanism

Sequential Patterns Registration Data Document and Usage Attributes

Database Query Language

Fig. 5. A general architecture system for web usage mining.

Clusters & Classification Rules

Intelligent Agents

Author's personal copy

6638

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

http request

client 1

Internet client 2

response

Web & Application Server log files

Web Server data acquisition (22 fields)

client n

Pre-processing

data collection data cleaning session identification data fusion transformation cleaned log files (required 5 fields)

Pattern Discovery

client ip address session-id request-url referrer-url session sequence Path analysis method

Pattern Analysis

Funnel counts Link graph Path plot and report Item plot and report Statistics plot Association rules

Interesting Rules, Patterns, and Statistics Fig. 6. The algorithm scheme of the proposed methodology for web usage mining.

discovery. Usually, several pre-processing tasks need to be done before performing web mining algorithms on the web server logs (Liu & Keselj, 2007). In our study, this stage was including data cleaning, transaction identification, session identification, data integration and transformation. These pre-processing tasks are the same for any web usage mining problem and they are discussed by Cooley et al. (1997, 1999a) and Cooley (2000). We start with the data acquisition from the web server and pre-processing to extract user navigation patterns from web log files. The raw log files are cleansed, formatted, and then grouped into meaningful session before being utilized by web usage analysis. 3.1.1. Sources and the structure of web logs Sources used in web usage mining are the web server access logs. These texts files automatically produce for each HTTP transaction by web server. Each access to a web page is recorded in the access log file of the web server that hosts it. Due to different server setting parameters, there are many types of web logs. Internet Information Server provides a number of different log file formats that log all requests to the web server. These are the log file formats Internet Information Server Supports:    

W3C extended log file format. NCSA log file format. Microsoft log file format. Logging to any ODBC data source.

In this application, W3C extended log file format of user access files was used. All the web log files were acquired from the web server in the Computer Centre of the Firat University for this study. A typical example of access row log is shown in Table 1. The available W3C extended log file fields are listed and described in Table 2. 3.1.2. Data cleaning Data cleaning is the first step performed in the pre-processing of web usage mining. In the raw logs, not all the log entries are valid for pattern analysis. We only want to keep the entries that carry relevant information. Therefore, the data cleaning step is used to eliminate the irrelevant entries from the access log files, which includes:  Firstly, in data cleaning process, entries that have status of ‘‘error” or ‘‘failure” should be removed.  Secondly, some access records which are generated automatically by a search engine agent should be identified and removed from the access log.  Thirdly, requests for picture files associated with requests for particular pages; A user’s request to view particular page often results in several log entries because that page includes other graphics, while we are only interested in what the users explicitly request, which are usually textual files (Liu & Keselj, 2007). Log entries with request files except ‘‘jpg”, ‘‘jpeg”, ‘‘gif”, ‘‘ico”, and ‘‘avi” etc. are also filtered out. Other types of eliminated requests include the JavaScript files (.js), the style sheet files (.css), etc.  And last, entries with unsuccessful HTTP status code; HTTP status codes are used to indicate the success of failure of requested event, and we only consider successful entries with HTTP status codes between 200 and 299 (Das et al., 2007).

3.1.3. Transaction identification The aim of transaction identification is to create meaningful clusters of references for each user. Cooley et al. (1999a) propose a general model for transaction identification. In their model, each user session is considered either as a single transaction consisting of many page references or a set of many single-page reference transaction. After the data cleaning, the log entries must be partitioned into logical clusters using one or a series of transaction identification modules. This module can be called as either a merge or a divide module the aim of the transaction identification is creating meaningful clusters of references for each user. Both types of modules take a transaction list and possibly some parameters as input, and output a transaction list that has been operated on by the function in the module in the same format as the input. Cooley et al. (1999a) present three heuristic methods for transaction identification. 3.1.4. Session identification A session can be described as the group of activities performed by a user from the moment he entered the website to the moment he left it. Therefore, session identification is the process of segmenting the access log of each user into individual access sessions.

Table 1 A web request from Firat’s web server log 2008-01-14 22:45:27 W3SVC1 ICME 192.168.4.2 GET /enformatik/default.asp - 80 - 88.231.58.180 HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1; +.NET+CLR+2.0.50727;+.NET+CLR+3.0.04506.30; +.NET+CLR+1.1.4322) - /rektorlugebaglibrm.asp web.firat.edu.tr 200 0 0 13720 608 78

Author's personal copy

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

6639

Table 2 Description of W3C extended log file fields (http://www.microsoft.com/technet/prodtechnol/windowsserver2003/library/iis/ (last accessed: 20.01.2008)) Field

Appears As

Description

Date Time Service name and ınstance number Server name Server IP address Method URI stem URI query

2008-01-14 22:45:27 W3SVC1

The date on which the activity occurred The time, in coordinated universal time (UTC), at which the activity occurred The Internet service name and instance number that was running on the client

ICME 192.168.4.2 GET /enformatik/default.asp –

Server port User name

80 –

Client IP address Protocol version User agent

88.231.58.180 HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+7.0; +Windows+NT+5.1;+.NET+CLR+2.0.50727; +.NET+CLR+3.0.04506.30;+.NET+ CLR+1.1.4322) – /rektorlugebaglibrm.asp web.firat.edu.tr 200 0 0 13720 608 78

The name of the server on which the log files entry was generated The IP address of the server on which the log files entry was generated The requested action (typically GET) The target of the action The query, if any, that the client was trying to perform. A Universal Resource Identifier (URI) query is necessary only for dynamic pages The server port number that is configured for the service The name of the authenticated user who accessed your server. Anonymous users are indicated by a hyphen The IP address of the client that made the request The protocol version —HTTP or FTP —that the client used The browser type and properties that the client used

Cookie Referrer Host HTTP status Protocol substatus Win32 status Bytes sent Bytes received Time taken

The The The The The The The The The

content of the cookie sent or received, if any site that the user last visited. This site provided a link to the current site host header name, if any HTTP status code sub-status error code Windows status code number of bytes that the server sent number of bytes that the server received length of time that the action took, in milliseconds

The purpose of the session identification is to group the page access of each user into individual access sessions. Cooley et al. (1999a) define a session as including the client IP address, the client user id, the URL of the accessed page and the time of access. Two variations time-oriented heuristic methods and a basic navigation-oriented heuristic are given below. Each heuristic h scans the user activity logs to which the web server log is partitioned. h1: For the session-duration-based method, total session duration may not exceed a threshold h. Given t0, the timestamp for the first request in a constructed session S, the request with a timestamp t is assigned to S, iff t  t0 6 h Liu, 2007. Discovered from empirical findings, a 30-min threshold for total session duration has been recommended (Spiliopoulou, Mobasher, Berendt, & Nakagawa, 2003). h2: For the page-stay-time-based method, total time spent on a page may not exceed a threshold d. Given t1, the timestamp for request assigned to constructed session S, the next request with timestamp t2 is assigned to S, iff t2  t1 6 d Liu, 2007. Generally, a conservative threshold for page-stay time, 10 min, has been proposed to capture the time for loading and studying the contents of a page (Spiliopoulou et al., 2003). h-ref: For the referrer-basic heuristic method, a request q is added to constructed session S if the referrer for q was previously invoked in S. Otherwise, q is used as the start of a new constructed session. With this method, a request q may potentially belong to more than one ‘‘open” constructed session, since q may have been accessed previously in multiple sessions. In this case, additional information can be used for disambiguation (Spiliopoulou et al., 2003). In our system, we applied the referrer-basic heuristic (h-ref) method for session identification which method leads to better experimental results. The log entries were partitioned into logical clusters using one or a series of session identification modules. An example of application of with h-ref heuristic is given in Table 3. We were used to import the data and create four variables: referrer, session id, request file and session sequence. The first lines of preprocessed logs from our dataset are given in Table 3.

3.2. Pattern discovery In order to extract patterns of usage from web log files are used data mining techniques for web usage mining. Pattern discovery is the key process of the web mining, which includes the algorithms and techniques from several research areas, such as data mining, machine learning, statistics and pattern recognition. The techniques such as statistical analysis, association rules, clustering, classification, sequential pattern and dependency modelling are used to discover rules and patterns (Cooley et al., 1997; Cooley, 2000). In this phase, path analysis method is applied to analyze pre-processed web log data files. Path analysis allows the user to determine the paths that visitors take as they navigate through a website. Also, it performs association analysis between web links and allows the user to extract sequential association rules among large sets of web links. 3.3. Pattern analysis The final stage of the web usage mining is pattern analysis, as described in Fig. 6. The aim of this process is to extract the interesting rules, patterns or statistics from the output of the pattern discovery process by eliminating the irrelative rules or statistics. The pattern analysis phase of web usage mining is one of providing tools to facilitate the transformation of information into knowledge. Many of web usage mining tools have incorporated a SQLlike web mining language, which firstly provides some objective criteria, supporting and confidence for example. The results of our application are given in Section 4. 4. Extracted interesting rules and patterns In this study, web server log files acquired from the computer centre of Firat University were used. The dataset was very large so we had to choose only one file (2 days) out of these data. The file chosen contained URL information running from January 14, 2008

Author's personal copy

6640

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

Table 3 An example of pre-processed web logs Referer

Session_id

Request_file

Session_sequence

– /default.asp /?git=fakulteler /eemuh/default.asp /eemuh/dersler_detay.asp /eemuh/default.asp – /default.asp /?git=enstituler /fenbilimleri/index.htm /fenbilimleri/dersler.htm –

67e2a4da826149c5 2008-01-14 08:38:17 67e2a4da826149c5 2008-01-14 08:38:17 67e2a4da826149c5 2008-01-14 08:38:17 67e2a4da826149c5 2008-01-14 08:38:17 67e2a4da826149c5 2008-01-14 08:38:17 67e2a4da826149c5 2008-01-14 08:38:17 67eb39ebce6c73c3 2008-01-14 23:43:23 67eb39ebce6c73c3 2008-01-14 23:43:23 67eb39ebce6c73c3 2008-01-14 23:43:23 67eb39ebce6c73c3 2008-01-14 23:43:23 67eb39ebce6c73c3 2008-01-14 23:43:23 67cb8704bb370ee2 2008-01-14 13:44:04

/default.asp?id=7 /?git=fakulteler /eemuh/default.asp /eemuh/dersler_detay.asp /eemuh/default.asp /eemuh/foto/igallery.asp /default.asp /?git=enstituler /fenbilimleri/index.htm /fenbilimleri/dersler.htm /fenbilimleri/computer.htm /default.asp

1 2 3 4 5 6 1 2 3 4 5 1

Table 4 Funnel counts Rule ıd

Rule

Item1

Item2

Item3

Item4

1 2 3 4

/sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ilan.htm /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ilan.htm ==> /sosyalbil/sonsite/index.htm /sosyalbil/sonsite/ilan.htm ==> /sosyalbil/sonsite/index.htm /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ilan.htm ==> /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ index.htm /veteriner/akadamik_database/akademik_personel_ic.asp==>/veteriner/akadamik_database/akedemikgoster.asp /med/dahili/dahilitip.htm ==> /med/dahili/dahiliabd.htm /sosyalbil/sonsite/ilan.htm ==> /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ilan.htm /med/cerrahi/cerrahitip.htm ==> /med/cerrahi/cerrahiabd.htm /veteriner/akadamik_database/picture/inc_default.asp ==> /veteriner/akadamik_database/picture/cat.asp /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/anabilim.htm

90 90 76 90

73 73 52 73

. 52 . 52

. . . 48

27 25 76 23 19 90

14 23 52 17 15 15

. . 48 . . .

. . 11 . . .

5 6 7 8 9 10

until January 15, 2008 with 146,178 lines. SAS Enterprise Miner 5.2 was used to explore and analyze the data using Path analysis method. The number of URLs was increased in the path analysis from the default of 100.000 to 1 million because of the size of our dataset. The original dataset was analyzed using Path analysis. In order to obtain an effective and interesting result, we was removed all the images and style sheet pages as they are a routine part of the initial access. These URLs were filtered from the original dataset. When re-running the path analysis, results were improved. We had to filter the data on many steps. Every time we remove the unnecessary websites, we had better results. When performing path analysis with web user logs, path completion does not always succeed. Path completion may fail, for example, if a visitor leaves the site and subsequently returns within the timeout period established by the session algorithm. Path completion may also fail if a visitor enters or travels within a web site using browser bookmarked links instead of navigation links in the current web page (http://support.sas.com/documentation/ (last accessed: 20.01.2008)). There are many different types of graphs that can be formed for performing path analysis, since a graph represents some relation defined on web pages. The most obvious is a graph representing the physical layout of a website, with web pages as nodes and hypertext links between pages as directed edges. Other graphs could be formed based on the types of web pages with edges representing similarity between pages, or creating edges that give the number of users that go from one page to another. Examples of useful information that were discovered through path analysis are given following: 4.1. Funnel counts Funnel counts show the drop-off in the number of visitors along a particular path of interest (http://support.sas.com/documenta-

Fig. 7. Funnel counts.

tion/ (last accessed: 20.01.2008)). It can be useful to see how visitor attrition occurs along a path, indicating points of interest such as where the biggest drop-off points are. The first ten items of the funnel counts are tabulated in Table 4. Fig. 7 graphically represented a line plot of the funnel counts by item number from the path analysis. 4.2. Link graph In order to model the paths followed by the visitors of the website, we should create a weighted transition graph. This link graph is created using the data residing on the web logs. Its nodes represent the web pages of the site, whereas the link between them the hyperlinks between the pages. This links carry weights, which represent the number of transitions from the ‘‘source” web page to the ‘‘target” web page.

Author's personal copy

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

Link graph works the ‘‘Start” node to the link graph as the starting point for the user’s visit to the website and the ‘‘Exit” node as the ending point of the user’s visit. In order to ensure that there is a directed path between any two nodes in the link graph, we add a link from the ‘‘Exit” node to the ‘‘Start” node. Due to the influence of caching, the amount of weights on all incoming links of a page might not be the same as the amount of weights on all outgoing links. To solve this problem, we can either assign extra incoming weights to the link to the start/exit node or distribute extra outgoing weights to the incoming links. Fig. 8 shows a link graph we have constructed using a web log file at the Firat University Website, in which the title of each page is shown beside the node representing the page. A link graph is a graphical representation of the data in the rules data set. (Fig. 8) displays a link graph of the path analysis results. Two data tables are used to produce the link graph: a node data table, and a link data table. As shown in Fig. 8, the link between /veteriner/yonetim.asp and /veteriner/akadamik_database/yonetim_ic.asp is thick, indicating a high confidence value. The link between /med/ogrenciler/ogrenciler.htm and /med/ogrenciler/1sinif.htm is thin, indicating a lower confidence value. Both of these rules have

6641

similar count and support values. Some links in the graph are not displayed because by default, the confidence threshold is set at 20%. But, we set the confidence threshold at 10%. 4.3. Path plot and path report As shown in Table 5, the path (rule) that has the highest count was /sosyalbil/sonsite/index.htm to /sosyalbil/sonsite/ilan.htm. Note that this rule also had high confidence, which indicates that there is an 81.11% chance that when a visitor clicks /sosyalbil/sonsite/index.htm, they will then click /sosyalbil/sonsite/ilan.htm. Fig. 9 displays a line plot depicting the visitor transition vector frequencies between different locations. The data set that underlies the path plot is the rules table. 4.4. Item report and item plot Fig. 10 displays a scatter plot (statistics histogram) of the items in a rule. The plot is shaded by the rule support level. Mouse pointer over a marker on the plot displays the items in the rule and the

Fig. 8. Link graph.

Table 5 First lines of the path report Chain size

Count

Support

Confidence

Rule

2 2 3 4

73 53 52 48

59.8361 43.4426 42.623 39.3443

81.1111 58.8889 71.2329 92.3077

2 2 2 2 2 2 2

23 19 17 15 14 14 12

18.8525 15.5738 13.9344 12.2951 11.4754 11.4754 9.8361

92 73.0769 73.913 78.9474 73.6842 51.8519 46.1538

/sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ilan.htm /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/index.htm /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ilan.htm ==> /sosyalbil/sonsite/index.htm /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ilan.htm ==> /sosyalbil/sonsite/index.htm ==> /sosyalbil/sonsite/ index.htm /med/dahili/dahilitip.htm ==> /med/dahili/dahiliabd.htm /med/temel/temeltip.htm ==> /med/temel/temelabd.htm /med/cerrahi/cerrahitip.htm ==> /med/cerrahi/cerrahiabd.htm /veteriner/akadamik_database/picture/inc_default.asp ==> /veteriner/akadamik_database/picture/cat.asp /sosyalbil/sonsite/anabilim.htm ==> /sosyalbil/sonsite/index.htm /veteriner/akadamik_database/akademik_personel_ic.asp ==> /veteriner/akadamik_database/akedemikgoster.asp /med/ogrenciler/ogrenciler.htm ==> /med/ogrenciler/1sinif.htm

Author's personal copy

6642

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

Fig. 9. Path plot.

Table 6 Items Report Target ıtem

Transaction count

Transaction support (%)

/eemuh/default.asp /eemuh/haber_detay_ic.asp /elkegitimi/ilanlar.asp /iletisim/default.asp /elkbilgi/lisansindex.asp /eemuh/foto/igallery.asp /fenbilimleri/duyurular.htm /cakaro/index.htm /dkonservatuar/talebeler.htm /eemuh/dersler_detay.asp /elkbilgi/dindex.asp /fenbilimleri/index.htm /fenbilimleri/genel.htm

61 27 22 19 18 18 16 15 13 12 11 10 10

50 22.1311 18.0328 15.5738 14.7541 14.7541 13.1148 12.2951 10.6557 9.8361 9.0164 8.1967 8.1967

Fig. 10. Item plot.

support level as shown in Fig. 10. The higher the support of the small squares, the darker their colour is. Notice that most of the small squares are building a straight line that shows the high number of website noises. The items report shows the number of times each item occurred in the dataset. The first lines of the items report that have the highest count are given in Table 6. Notice that the item with the highest count is /eemuh/default.asp. This is the first page that a user will view when he logs in to the Firat University website. 4.5. Statistics plot The statistics plot and the items plot are very helpful to understand the distribution of the rules obtained from the analysis asso-

Fig. 11. Statistics plot.

Author's personal copy

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

6643

Table 7 Rules table

ciated with the support and the confidence numbers. The statistics plot is given in Fig. 11. Every small square in this graph represents one association rule with two items, antecedent and consequent. Each rule is identified by the support and confidence number. 4.6. Rules table Table 7 enables to open a table of the data that is associated with the graph. Rules table column headings display variable labels names used in SAS code.

5. Conclusion Websites are one of the most important advertisement tools in international area for universities and other foundation. Therefore, content and design of web pages are very significant for web designers. Analyzing of the web user access log files can help understand the user behaviours and web structure, there by improving the design of web components and web applications (Velezquez, Yasuda, Aoki, & Weber, 2004). The quality of a website can be evaluated by analyzing user accesses of the website. Many factors may affect the quality of a website, such as content, presentation, ease to use, user waiting time, and so on. Web usage mining results can be used to improve the website design and increase satisfaction. So, web analyzers have to analyze the user access log files of their web server to determine systems error, access size, associate between pages, link connections to increase their web pages performance and effectiveness. The aim of this study is to help the web designer and web administrator to improve their website by determining occurred link connections in the website. So, raw log files were pre-processed and the path analysis technique was used to investigate the web log files of URL information concerning access to electronic sources. The proposed methodology was applied to the user access log files in the web server of Firat University. The results and findings of this experimental study can be used by the web administration and web designer in order to plan the upgrading and enhancement to the website. Web log application of Path Analysis provides us with a count of the number of times each link has occurred in the dataset and a list of association rules. The graph link is very easy to interpret. It contains association rules that are very helpful in understanding the path that administrators take as they log in through the university website. It is helpful to use these results for better organization to the Firat University website. Also, it is possible to take this study much further by investigating the web log data on a continuing basis.

Application of web usage data can be used to better understand web usage, and apply this specific knowledge to better serve users. More research needs to be done in e-Commerce, bioinformatics, computer security, web intelligence, intelligent learning, and database systems, finance, marketing, healthcare and telecommunications by using web usage mining. Acknowledgements This study is supported by the Scientific Research Projects Unit of Firat University (Project No.1526). The authors would like to thank the Computer Centre of the Firat University for providing the web server log files to us. In addition, they would like to thank Prof. Dr. Mustafa Poyraz for providing necessary opportunity of this study. References Araya, S., Silva, M., & Weber, R. (2004). A methodology for web usage mining and its applications to target group identification. Fuzzy Sets and Systems, 148, 139–152. Azhar, A. (2005). Web usage mining using appriori algorithm: UUM-learning care portal case. In Proceeding at ICKM 05, UPM. Battioui, C. (2007). Data mining techniques to analyze a library database. SAS Institute Inc. Paper 076-31. Chi, E. H., Pirolli, P., Chen, K., & Pitkow, J. (2001). Using information scent to model user information needs and actions on the web. In Proceedings of ACM CHI 2001 conference on human factors in computing systems, April (pp. 490–497). Seattle, WA: ACM Press. Cooley, R. (2000). Web usage mining: Discovery and application of interesting patterns from web data. PhD thesis, University of Minnesota. Cooley, R., Mobasher, B., & Srivastava, J. (1997). Web mining: Information and pattern discovery on the world wide web. In Proceedings of the 9th IEEE international conference on tools with artificial intelligence (ICTAI’97), USA (pp. 558–567). Cooley, R., Mobasher, B., & Srivastava, J. (1999a). Data preparation for mining world wide web browsing patterns. Journal of Knowledge and Information Systems, 1(1), 1–27. Cooley, R., Mobasher, B., & Srivastava, J. (1999b). Grouping Web page references into transactions for mining world wide web browsing patterns. Journal of Knowledge and Information Systems, 1(10), 1–13. Das, R., Turkoglu, I., & Poyraz, M. (2007). Analyzing of system errors for increasing a web server performance by using web usage mining. Istanbul University – Journal of Electrical and Electronics Engineering (IU-JEEE), 7(2), 379–386. Istanbul. Desikan, P., & Srivastava, J. (2003). Mining information from temporal behaviour of web usage. AHPCRC technical report TR-2003-121. Drott, M. (1998). Using web server logs to improve site design. In Proceedings of the ACM conference on computer documentation (pp. 43–50). Etzioni, O. (1996). The world wide web: Quagmire or gold mine. Communications of the ACM, 39(11), 65–68. Gunduz, S. (2003). Recommendation models for web users: User interest model and click-stream tree. PhD. thesis, Institute of Science and Technology, Istanbul Technical University, Turkey. Khasawneh, N., & Chan, C. C. (2005). Web usage mining using rough sets. In IEEE annual meeting of the north american fuzzy information processing society – (NAFIPS’05).

Author's personal copy

6644

R. Das, I. Turkoglu / Expert Systems with Applications 36 (2009) 6635–6644

Kosala, R., & Blockeel, H. (2000). ‘‘Web mining research: A survey”, SIGKDD: SIGKDD explorations: Newsletter of the special interest group (SIG) on knowledge discovery and data mining. ACM, 2(1), 1–15. Liu, B. (2007). Web data mining: Exploring hyperlinks, contents and usage data. Springer. IBSN: 13-978-3-540-37881-5 (532p.). Liu, H., & Keselj, V. (2007). Combined mining of web server logs and web contents for classifying user navigation patterns and predicting users’ future requests. Data and Knowledge Engineering, 61(2), 304–330. Oosthuizen, C., Wesson, J., & Cilliers, C. (2006). Visual web mining of organizational websites. In Proceedings of the information visualization (IV’06). IEEE Computer Society. Pal, S., Talwar, V., & Mitra, P. (2002). Web Mining in soft computing framework: Relevance state, of the art and future directions. IEEE Transactions on Neural Networks, 13(5), 1163–1177. Sahinler, S., & Gorgulu, O. (2000). Path Analysis and an application. M.K.University, Journal of Agriculture Faculty, 5(1–2), 87–102. Sarukkai, R. R. (1999). Link prediction and path analysis using Markov chains. In The proceedings of the 9th world wide web conference. Spiliopoulou, M., Mobasher, B., Berendt, B., & Nakagawa, M. (2003). A framework for evaluation of session reconstruction heuristic in web usage analysis. INFORMS Journal on Computing, 15(2), 171–190. Spiliopoulou, M., & Pohle, C. (2001). Data mining for measuring and improving the success of websites. Journal of Data Mining and Knowledge Discovery, 5(2), 85–114.

Spiliopoulou, M., Pohle, C., & Faulstich, L. (2000). Improving the effectiveness of a website with web usage mining. Lecture notes in computer science. Berlin: Springer-Verlag. pp. 142–162. Srivasta, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2), 12–23. Tug, E., Sakiroglu, A. M., & Arslan, A. (2006). Automatic discovery of the sequential accesses from web log data files via a genetic algorithm. Knowledge-Based Systems, 19, 180–186. Velezquez, J. D., Yasuda, H., Aoki, T., & Weber, R. (2004). A New similarity measure to understand visitor behaviour in the website. IEICE Transaction on Information Systems, E87-D(2), 389–396. Wang, B., & Liu, Z. (2003). Web mining research. In Fifth international conference on computational intelligence and multimedia applications (ICCIMA’03). IEEE Computer Society. Xue, G.-R., Zeng, H.-J., Chen, Z., Ma, W.-Y., & Lu, C.-J. (2002). Log mining to improve the performance of site search. In Proceedings of the third international conference on web information systems engineering (workshops) – (WISEw’02) (p. 238). Zaiane, O. R., Xin, M., & Han, J. (1998). Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. In Proceedings of the advances in digital libraries conference (ADL’98), Santa Babara, CA. Zhu, J., Hong, J., & Hughes, J. G. (2002). Using Markov chains for link prediction in adaptive websites. In Proceedings of ACM SIGWEB hypertext.

Suggest Documents